Developing and Analyzing a Spanish Corpus for Forensic Purposes

Ángela Almela, Gema Alcaraz-Mármol, Arancha García-Pinar, Clara Pallejá


In this paper, the methods for developing a database of Spanish writing that can be used for forensic linguistic research are presented, including our data collection procedures. Specifically, the main instrument used for data collection has been translated into Spanish and adapted from Chaski (2001). It consists of ten tasks, by means of which the subjects are asked to write formal and informal texts about different topics. To date, 93 undergraduates from Spanish universities have already participated in the study and prisoners convicted of gender-based abuse have participated. A twofold analysis has been performed, since the data collected have been approached from a semantic and a morphosyntactic perspective. Regarding the semantic analysis, psycholinguistic categories have been used, many of them taken from the LIWC dictionary (Pennebaker et al., 2001). In order to obtain a more comprehensive depiction of the linguistic data, some other ad-hoc categories have been created, based on the corpus itself, using a double-check method for their validation so as to ensure inter-rater reliability. Furthermore, as regards morphosyntactic analysis, the natural language processing tool ALIAS TATTLER is being developed for Spanish.  Results shows that is it possible to differentiate non-abusers from abusers with strong accuracy based on linguistic features.


forensic linguistics; linguistic corpus; morphosyntactic analysis; semantics

Full Text:



Almela, A., Alcaraz-Mármol, G. and Cantos, P. (2015). Analysing deception in a psychopath's speech: A quantitative approach. DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada, 31(2), 559-572.

Almela, A., Valencia-García, R. and Cantos, P. (2013). Seeing through Deception: A Computational Approach to Deceit Detection in Spanish Written Communication. Linguistic Evidence in Security, Law and Intelligence, 1(1), 3-12.

Baker, P. (2012). Acceptable bias? Using corpus linguistics methods with critical discourse analysis. Critical Discourse Studies, 9(3), 247-256.

Cantos Gomez, P. (2013). Statistical Methods in Language and Linguistic Research. Sheffield, UK: Equinox Publishing Ltd.

Chaski, C.E. (2001). Empirical Evaluations of Language-based Author Identification Techniques. International Journal of Speech, Language and Law (previously Forensic Linguistics), 8(1): 1-66.

Chaski, C.E. (2005). Who's at the keyboard? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence, Spring 2005.

Chaski, C.E. (2007). The Keyboard Dilemma and Author Identification, in Advances in Digital Forensics III, Sujeet Shinoi and Philip Craiger, eds., New York: Springer.

Chaski, C.E. (2012). Best Practices and Admissibility of Forensic Author Identification. Journal of Law and Policy, 21(2). Brooklyn Law School.

Coulthard, M. (1994). On the use of corpora in the analysis of forensic texts. International Journal of Speech, Language and Law (previously Forensic Linguistics), 1(1), 27-43.

Eagleson, R. (1994). Forensic analysis of personal written texts: a case study. In J. Gibbons (Ed.), Language and the Law. London: Longman.

Fornaciari, T. and Poesio, M. (2012). Sincere and deceptive statements in Italian criminal proceedings. In Proceedings of the International Association of Forensic Linguists Tenth Biennial Conference (pp. 126–138).

Guillén, V., Vargas, C., Pardiño, M., Martínez, P. and Suárez, A. (2008). Exploring State-of-the-art Software for Forensic Authorship Identification. International Journal of English Studies, 8(1), 1-28.

Hancock, J.T., Woodworth, M.T. and Porter, S. (2011). Hungry like the wolf: A word-pattern analysis of the language of psychopaths. Legal and Criminological Psychology, 18(1), 1-13.

Johnson, S.A. (2006). Physical Abusers and Sexual Offenders: Forensic and Clinical Strategies. New York: Taylor and Francis.

Juola, P. (2006). Authorship attribution. Foundations and Trends in Information Retrieval, 1(3), 233-334.

Kennedy, G. (1998). An Introduction to Corpus Linguistics. London/New York: Longman.

Kniffka, H., (2000). Anonymous Authorship Analysis without Comparison Data? A Case Study with methodological impact. Linguistische Berichte, 182, 179-198.

Koppel, M., Schler, J. and Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9-26.

Leech, G. (2005). Adding Linguistic Annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbrow Books.

Leech, G. (1992). Corpora and theories of linguistic performance. In Jan Svartvik (Ed.), Directions in corpus linguistics. Berlin: Mouton De Gruyter (pp. 105-122).

McEnery, T. (2003). Corpus Linguistics. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.

McEnery, T. and Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press.

Newman, M. L., Pennebaker, J. W., Berry, D. S. and Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29, 665-675.

Parodi, G. (2008). Lingüística de corpus: Una introducción al ámbito. Revista de Lingüística Teórica y Aplicada, 46(1), 93-119.

Pennebaker, J. W., Francis, M. E. and Booth, R. J. (2001). Linguistic Inquiry and Word Count. Mahwah (NJ): Erlbaum Publishers.

Renouf, A. (1987). Corpus Development, in Sinclair, J. M. (ed.) Looking Up. Glasgow/London: Harper Collins Publishers.

Saldanha, G. (2009). Principles of corpus linguistics and their application to translation studies research. Tradumàtica 7, 1-7.

Shapero, J. J. (2011). The Language of Suicide Notes. Unpublished Thesis. University of Birmingham

Stone, P.J., Bales, R.F., Namenwirth, J.Z., and Ogilvie, D.M. (1962). The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information. Journal of the Society for General Systems Research, October 1962.

Stone, P.J., Dunphy, D., Smith, M.S., and Ogilvie, D.M. (1966). The General Inquirer: a computer approach to content analysis. Cambridge, MA: MIT Press.

Teubert, W. (2005). My version of corpus linguistics. International Journal of Corpus Linguistics, 10(1), 1-13.

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins.

Zipf. G.K. (1949). Human Behavior and the Principle of Least Effort. Cambridge, Massachusetts: Addison-Wesley.



  • There are currently no refbacks.