Approach and architecture for categorization and reveal of aggression features in russian text content | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2021. № 54. DOI: 10.17223/19988605/54/7

Approach and architecture for categorization and reveal of aggression features in russian text content

Methods and systems, involved in detection of aggressive features in text content, find sufficiently broad application. Particularly, they are used for analysis of comments and reviews online, sentiment analysis, concerning certain events, development of digital assistants for moderators of network discussions, etc. Thereby, complex structure of the text content requires reduction of text dimensions for analysis implementation. This paper discusses aggression features in Russian text. The relevant sources consider various text-based aggression features, but lack an overarching classification of these features, that would be built on their common foundations. Complex structure of textual content requires dimension reduction in it to apply analytic methods. For classification of textual messages and employment of machine learning methods, text vectorization is necessary, based on certain features. From this perspective, such a classification is established in this paper. All the features are divided into five classes: lexical, morphological, statistical, conversational and indirect. Specific words and set expressions, which can possibly denote violence, are classified as the aggressive content. Morphological features have to do with morphems and word building, neologism creation and derivative words using suffixes or prefixes. Statistical features have to do with the frequency of certain parts of speech and punctuation. Discource features is the set of features most difficult to be extracted, because these features have to do with demagogy, different stylistic variations of speech, sarcasm and irony, word distortion and other approaches, which are difficult to formalize and reveal. Indirect features are the markers of emotional expressiveness, having to do mostly with the presentational aspects of the text (masking, letter case, etc.). Features and items for each class are aggregated in the tabular view. Approaches are proposed to automate these feature detection processes; these approaches are based on thesauri, natural language processing libraries and generic software tooling. The architecture of a software suite is developed for text message vectorization. It is specified with class diagrams, describing the functional units of relevant transformations, filtering and vectorization of text content. The approach, implemented in this architecture, generally fits into the framework, which includes stop-word removal, tokenization, part-of-speech tagging, stemming and other common approaches to feature extraction. The specificity of this approach consists in establishment of different sequences of data transformations for different kinds of features, then result aggregation is performed. Recognition accuracy estimation is also performed. The approach, proposed in this paper, allows to estimate the aggression features in the text content with decent accuracy, whereas the estimation errors arise primarily because of polysemy, particularly, because common words can have additional negative meanings, relevant for certain niche contexts. Limitations of this approach preclude to use it for analysis of heterogeneous content or mix of languages or mix of languages.

Download file

Counter downloads: 183

Keywords

text analysis, aggression, emotion mining, sentiment analysis, natural language processing

Authors

Name	Organization	E-mail
Levonevskiy Dmitriy K.	St. Petersburg Federal Research Center of the Russian Academy of Sciences	dlewonewski.8781@gmail.com
Saveliev Anton I.	St. Petersburg Federal Research Center of the Russian Academy of Sciences	saveliev.ais@yandex.ru

Всего: 2

References

Уздяев М.Ю. Распознавание агрессивных действий с использованием нейросетевых архитектур 3D-CNN // Известия ТулГУ. Технические науки. 2020. № 2. С. 316-330.

Уздяев М.Ю., Левоневский Д.К., Шумская О.О., Летенков М.А. Метод детектирования агрессивных пользователей информационного пространства на основе генеративно-состязательных нейронных сетей // Информационноизмерительные и управляющие системы. 2019. Т. 17, № 5. С. 60-68.

Russian Language Toxic Comments Small dataset with labeled comments from 2ch.hk and pikabu.ru. URL: https://www.kaggle.com/blackmoon/russian-language-toxic-comments (accessed: 04.08.2020).

Левенштейн В.И. Двоичные коды с исправлением выпадений, вставок и замещений символов // Доклады Академии наук СССР. 1965. Т. 163, № 4. С. 845-848.

Reyes A., Rosso P. Making objective decisions from subjective data: Detecting irony in customer reviews // Decision support systems. 2012. V. 53, № 4. P. 754-760.

Rosa H., Pereira N., Ribeiro R., Ferreira P.C., Carvalho J.P., Oliveira S., Coheur L., Paulino P., Veiga Simao A.M., Trancoso I. Automatic cyberbullying detection: a systematic review // Computers in Human Behavior. 2019. V. 93. P. 333-345.

Сбоев А.Г., Гудовских Д.В., Молошников И.А., Кукин К.А., Рыбка Р.Б., Иванов И.И., Власов Д.С. Автоматическое выделение психолингвистических характеристик текстов в рамках концепции Big Data // Современные информационные технологии и ИТ-образование. 2013. № 9. С. 433-438.

Ковалёв А.К., Кузнецова Ю.М., Минин А.Н., Пенкина М.Ю., Смирнов И.В., Станкевич М.А., Чудова Н.В. Методы выявления по тексту психологических характеристик автора (на примере агрессивности) // Вопросы кибербезопасности. 2019. Т. 4, № 32. С. 72-79.

Девяткин Д.А., Кузнецова Ю.М., Чудова Н.В., Швец А.В. Интеллектуальный анализ проявлений вербальной агрессивно сти в текстах сетевых сообществ // Искусственный интеллект и принятие решений. 2014. № 2. С. 27-41.

Петрова Н.Е., Рацибурская Л.В. Язык современных СМИ: Средства речевой агрессии. М. : Флинта : Наука, 2011. 160 с.

Parapar J., Losada D.E., Barreiro A. Combining Psycho-linguistic, Content-based and Chat-based Features to Detect Predation in Chatrooms // J. UCS. 2014. V. 20, № 2. P. 213-239.

Gordeev D. Automatic detection of verbal aggression for Russian and American imageboards // Procedia - Social and Behavioral Sciences. 2016. V. 236. P. 71-75.

Levonevskiy D., Malov D., Vatamaniuk I. Estimating Aggressiveness of Russian Texts by Means of Machine Learning // Interna tional Conference on Speech and Computer. Springer, Cham, 2019. P. 270-279.

Zykova I.V. Perception of verbal communication reflected in Russian and English phraseology: towards a new theory of Phraseologism-formation // Procedia - social and behavioral sciences. 2016. V. 236. P. 139-145.

Medhat W., Hassan A., Korashy H. Sentiment analysis algorithms and applications: A survey // Ain Shams Engineering Journal. 2014. V. 5, № 4. P. 1093-1113.

Ventirozos F.K., Varlamis I., Tsatsaronis G. Detecting aggressive behavior in discussion threads using text mining // International Conference on Computational Linguistics and Intelligent Text Processing. Springer, Cham, 2017. P. 420-431.

Mantyla M.V., Graziotin D., Kuutila M. The evolution of sentiment analysis - a review of research topics, venues, and top cited papers // Computer Science Review. 2018. V. 27. P. 16-32.