Approach and architecture for categorization and reveal of aggression features in russian text content
Methods and systems, involved in detection of aggressive features in text content, find sufficiently broad application. Particularly, they are used for analysis of comments and reviews online, sentiment analysis, concerning certain events, development of digital assistants for moderators of network discussions, etc. Thereby, complex structure of the text content requires reduction of text dimensions for analysis implementation. This paper discusses aggression features in Russian text. The relevant sources consider various text-based aggression features, but lack an overarching classification of these features, that would be built on their common foundations. Complex structure of textual content requires dimension reduction in it to apply analytic methods. For classification of textual messages and employment of machine learning methods, text vectorization is necessary, based on certain features. From this perspective, such a classification is established in this paper. All the features are divided into five classes: lexical, morphological, statistical, conversational and indirect. Specific words and set expressions, which can possibly denote violence, are classified as the aggressive content. Morphological features have to do with morphems and word building, neologism creation and derivative words using suffixes or prefixes. Statistical features have to do with the frequency of certain parts of speech and punctuation. Discource features is the set of features most difficult to be extracted, because these features have to do with demagogy, different stylistic variations of speech, sarcasm and irony, word distortion and other approaches, which are difficult to formalize and reveal. Indirect features are the markers of emotional expressiveness, having to do mostly with the presentational aspects of the text (masking, letter case, etc.). Features and items for each class are aggregated in the tabular view. Approaches are proposed to automate these feature detection processes; these approaches are based on thesauri, natural language processing libraries and generic software tooling. The architecture of a software suite is developed for text message vectorization. It is specified with class diagrams, describing the functional units of relevant transformations, filtering and vectorization of text content. The approach, implemented in this architecture, generally fits into the framework, which includes stop-word removal, tokenization, part-of-speech tagging, stemming and other common approaches to feature extraction. The specificity of this approach consists in establishment of different sequences of data transformations for different kinds of features, then result aggregation is performed. Recognition accuracy estimation is also performed. The approach, proposed in this paper, allows to estimate the aggression features in the text content with decent accuracy, whereas the estimation errors arise primarily because of polysemy, particularly, because common words can have additional negative meanings, relevant for certain niche contexts. Limitations of this approach preclude to use it for analysis of heterogeneous content or mix of languages or mix of languages.
Keywords
text analysis, aggression, emotion mining, sentiment analysis, natural language processingAuthors
Name | Organization | |
Levonevskiy Dmitriy K. | St. Petersburg Federal Research Center of the Russian Academy of Sciences | dlewonewski.8781@gmail.com |
Saveliev Anton I. | St. Petersburg Federal Research Center of the Russian Academy of Sciences | saveliev.ais@yandex.ru |
References

Approach and architecture for categorization and reveal of aggression features in russian text content | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2021. № 54. DOI: 10.17223/19988605/54/7