Selecting text features relevant for authorship attribution
The article discusses the problem of text features that are relevant for the tasks of the author attribution of the text. Analysis of the works in this area demonstrates the efficiency of such features as bigram and trigram symbols and words, functional words, distribution of words by parts of speech, most frequent words, punctuation marks, word length distribution and length of sentences. These are formal and formal-semantic linguistic elements of all the subsystems of the language system. The most commonly used diagnostic parameters in authorship expert review are formal features of the lower subsystems of the language system. Usually the speaker is not aware of these signs in the generation of speech. These elements are easy to distinguish for a programmer who does not have a special degree in Linguistics. In identifying the statistical patterns of use of such elements small amounts of text may be involved. Elements of formal-semantic segmentation of speech are less involved as a diagnostic parameter. The author of the text may deliberately simulate these parameters. The programmer has to use pre-tagged texts, a special linguistic analysis of texts and a specialised corpus are needed. In identifying the statistical patterns of use of such elements big amounts of text may be involved. The authors name features of text that can be diagnosing only in certain types of texts. Idiosyncratic features (lexical, grammatical errors, etc.) are not relevant in the edited and corrected texts, the use of metadata is applicable to computer texts, etc. The authors note the unsolved problem of context independent features of the text. Many of the results of authorship attribution were obtained in the analysis of texts in English and other Indo-European languages. The relevance of these features for Russian texts is topical. It is especially necessary to identify the relevant grammatical features. The Russian authorship attribution tested the methodology on the material of literary texts. It is now necessary to identify (define) the composition (set) of context-independent features and attributes that are relevant for analysing texts of a certain type. Effective use (application) of the morphological and syntactic text features is possible only on the basis of linguistic text markup. The Russian authorship attribution needs a specialised linguistically marked Russian corpus.
Keywords
data mining,
computer science,
interdisciplinary methods of research,
stylometry,
authorship expert review,
авторская атрибуция текста,
интеллектуальный анализ данных,
информатика,
междисциплинарные методы исследования,
стилометрия,
автороведческая экспертизаAuthors
Rezanova Zoya I. | Tomsk State University; Tomsk Polytechnic University | resso@rambler.ru; resso@mail.tsu.ru; rezanovazi@mail.ru |
Romanov Alexandr S. | Tomsk State University of Control Systems and Radioelectronics | alexx.romanov@gmail.com |
Meshcheryakov Roman V. | Tomsk State University of Control Systems and Radioelectronics | mrv@ieee.org |
Всего: 3
References
The risks of metadata and hidden information [Electronic resource]. - 2007. - URL: http: // www.stg.srs.com/eds/docdet/archive/BitformFortune100Study.pdf.
Романов А.С., Шелупанов А.А., Мещеряков Р.В. Разработка и исследование математических моделей, методик и программных средств информационных процессов при идентификации автора текста. - Томск: В-Спектр, 2011. - 188 с.
Diederich J. Authorship attribution with support vector machines / J. Diederich, J. Kinder-mann, E. Leopold // Applied Intelligence. - Springer Netherlands, 2003. - Vol. 19, №1-2. - P. 109123.
Migletz J. Automated metadata extraction [Electronic resource] / J. Migletz. - 2008. - URL: http://simson.net/clips/students/ 08Jun_Migletz.pdf.
Martindale C. On the utility of content analysis in author attribution: The federalist / C. Martindale, D. McKenzie // Computers and the Humanities. - 1995. - Vol. 29. - P. 259-270.
Elliot W. Was the Earl of Oxford the true Shakespeare? / W. Elliot, R. Valenza // Notes and Queries. - 1991. - Vol. 38. - P. 501-506.
Oman W.P. Programming style authorship analysis / W.P. Oman, R.C. Cook // Proceedings of the 17th Annual ACM Computer Science Conference. - NY, 1989. - P. 320-326.
Abbasi A. Identification and comparison of extremist-group Web forum messages using authorship analysis / A. Abbasi, H. Chen // IEEE Intelligent Systems. - 2005. - Vol. 20, № 5. - P. 67-75.
Комиссаров А.Ю. Криминалистическое исследование письменной речи с использованием ЭВМ: дис.. канд. юрид. наук. - М., 2001. - 225 с.
Grant T. Identifying reliable, valid markers of authorship: A reponse to Chaski / T. Grant, K. Baker // Forensic Linguistics. - 2001. - Vol. 8, № 1. - P. 66-79.
Koppel M. Exploiting stylistic idiosyncrasies for authorship attribution / M. Koppel, J. Schler // Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico, 2003. - 2003. - P. 69-72.
Foster D. Author Unknown: Adventures of a Literary Detective / D. Foster. - London : Owl Books, 2000. - 320 p.
Chaski C.E. Multilingual Forensic Author Identification through N-Gram Analysis [Electronic resource] / C.E. Chaski // Proceedings of the 8th Biennial Conference on Forensic Linguistics/Language and Law, July 2007, Seattle, WA. - 2007. - URL: http:// www. allacademic. com/meta/p177064_index.html
Эл Л.С. Вывод и оценка параметров дальнодействующей триграммной модели языка / Л.С. Эл, С.В. Протасов // Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Междунар. конф. «Диалог», Бекасово, 4-8 июня 2008 г. - М., 2008. -Вып. 7 (14). - С. 443-448.
Yu B. English usage comparison between native and non-native english speakers in academic writing [Electronic resource] / B. Yu, Q. Mei, C. Zhai // Proceedings of ACH/ALLC. - 2005. -URL: http://mustard.tapor.uvic.ca/cocoon/ach_abstracts/xq /xhtml.xq?id=207.
Oakes M. Text categorization: Automatic discrimination between US and UK English using the chi-square text and high ratio pairs / M. Oakes // Research in Language. - 2003. - Vol. 1. P. 143156.
Nowson S. Identifying more bloggers: Towards large scale personality classifiation of personal weblogs [Electronic resource] / S. Nowson, J. Oberlander. - 2007. - URL: http:// nowson.com/ papers/Now0berICWSM07.pdf
Hoover D.L. Delta prime? / D.L. Hoover // Literary and Linguistic Computing. - 2004. -Vol. 19, № 4. - P. 477-495.
Halteren H. New machine learning methods demonstrate the existence of a human stylome / H. Halteren, R.H. Baayen, F. Tweedie et al. // Journal of Quantitative Linguistics. - 2005. - Vol. 12, № 1. - P. 65-77.
Green T.R.G. The necessity of syntax markers: Two experiments with artificial languages / T.R.G. Green // Journal of Verbal Learning and Verbal Behavior. - 1979. - Vol. 18. - P. 481-96.
Burrows J. "An ocean where each kind..": Statistical analysis and some major determinants of literary style / J.F. Burrows // Computers and the Humanities. - 1989. - Vol. 23, №4. - P. 309-321.
Сысуев В. Проект «Пси Офис» [Электронный ресурс]. - 2002. - Режим доступа: http://psy-two.narod.ru/embedded.html.
Chaski C.E. Empirical evaluations of language-based author identification // Forensic Linguistics. - 2001. - Vol. 8, № 1. - P. 1-65.
Mosteller F. Inference and Disputed Authorship: The Federalist / F. Mosteller, D.L. Wallace. - Reading, MA : Addison-Wesley, 1964 - 287 p.
Baayen R.H. An experiment in authorship attribution / R.H. Baayen, H.V. Halteren, A. Neijt et al. // Proceedings of JADT 2002. - Universit'e de Rennes, St. Malo, 2002. - P. 29-37.
Morton A.Q. Literary Detection: How to Prove Authorship and Fraud In Literature and Documents / A.Q. Morton. - New York : Scribner's, 1978. - 221 p.
Kruh L. A basic probe of the Beale cipher as a bamboozlement. P. 1 / L. Kruh // Cryptolo-gia. - 1982. - Vol. 6, № 4. - P. 378-382.
Argamon S. Style mining of electronic messages for multiple authorship discrimination: first results / S. Argamon, M. Saric, S.S. Stein // Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. - NY : ACM, 2003. - P. 475-480.
Koppel M. Automatically categorizing written texts by author gender / M. Koppel, S. Argamon, A.R. Shimoni // Literary and Linguistic Computing. - 2002. - Vol. 17, № 4. - P. 401-412.
Argamon-Engleson A. Style-based text categorization: What newspaper am I reading / A. Argamon-Engleson, M. Koppel, G. Avneri // Proceedings of the AAAI Workshop of Learning for Text Categorization. - 1998. - P. 1-4.
Stamatatos E. Computer-based authorship attribution without lexical measures / E. Stamata-tos, N. Fakotakis, G. Kokkinakis // Computers and the Humanities. - 2001. - Vol. 35, № 2. - P. 193214.
Кукушкина О.В. Определение авторства текста с использованием буквенной и грамматической информации / О.В. Кукушкина, А.А. Поликарпов, Д.В. Хмелев // Проблемы передачи информации. - 2001. - Т. 37, вып.2. - С. 96-109.
Baayen R.H. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution / R.H. Baayen, H.V. Halteren, F.J. Tweedie // Literary and Linguistic Computing. - 1996. -Vol. 11. - P. 121-131.
Argamon S. Routing documents according to style [Electronic resource] / S. Argamon, M. Koppel, G. Avneri // Proceedings of the 1st International Workshop on Innovative Information. -1998. - URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.688&rep=rep1&type=pdf.
Juola P. What can we do with small corpora?: Document categorization via cross-entropy [Electronic resource] // Proceedings of an Interdisciplinary Workshop on Similarity and Categorization, Edinburgh, UK. - 1997. - URL: http://www.mathcs.duq.edu/~juola/papers.d/identification.ps.
Хрулев О. Определение автора по тексту на естественном языке [Электронный ресурс]. - Режим доступа: www.geshtalt.ru/ psycholingvist_author.php.
Simpson E.H. Measurement of Diversity / E.H. Simpson // Nature. - Macmillan Publishers Ltd, 1949. - № 163. - P. 688.
Tweedie F.J. How Variable may a Constant be? Measures of Lexical Richness in Perspective / F.J. Tweedie, H. Baayen // Computers and the Humanities. - Springer, 1998. - Vol. 32, № 5. -P. 323-352.
Yule G.U. The Statistical Study of Literary Vocabulary. - Cambridge University Press, 1944. - 306 p.
De Vel O. Mining e-mail content for author identification forensics / O. De Vel, A. Anderson, M. Corney et al. // ACM SIGMOD. - NY : ACM, 2001. - Rec. 30. - № 4. - P. 55-64.
Zheng R. A framework for authorship analysis of online messages: Writing-style features and techniques / R. Zheng, J. Li, Z. Huang et al. // Journal of the American Society for Information Science and Technology. - 2006. - Vol. 57, № 3. - P. 378-393.
Peng F. Language independent authorship attribution using character level language models / F. Peng, D. Schuurmans, S. Wang et al. // Proceedings of the 10th conference on European chapter of the ACL. - 2003. - Vol. 1. - P. 267-274.
Peng F. Augumenting Naive Bayes Text Classifier with Statistical Language Models / F. Peng, D. Schuurmans, S. Wang // Information Retrieval. - 2004. - Vol. 7, № 3-4. - P. 317-345.
Kjell B. Authorship determination using letter pair frequencies with neural network classifiers // Literary and Linguistic Computing. - 1994. - Vol. 9, № 2. - P. 119-124.
Hoorn J. Neural network identification of poets using letter sequences / J. Hoorn, S. Frank, W. Kowalczyk et al. // Literary and Linguistic Computing. - 1999. - Vol. 14, № 3. - P. 311-338.
Kjell B. Authorship attribution of text samples using neural networks and Bayesian classifiers / B. Kjell // IEEE International Conference on Systems, Man and Cybernetics. San Antonio, TX, 1994.
Benedetto D. Language Trees and Zipping / D. Benedetto, E. Caglioti, V. Loreto // Phys. Rev. Lett. - 2002. - Vol. 88, №4. - P. 487-490.
Шевелев О.Г. Методы автоматической классификации текстов на естественном языке: учеб. пособие. - Томск: ТМЛ-Пресс, 2007. - 144 с.
Rudman J. The state of authorship attribution studies: Some problems and solutions // Computers and the Humanities. - 1998. - Vol. 31. - P. 351-365.
Хмелев Д.В. Распознавание автора текста с использованием цепей А.А. Маркова // Вестн. МГУ. - Сер. 9: Филология. - 2000. - № 2. - С. 115-126.