Building a style sheet using text classification algorithms based on decision trees
With the use of classification algorithms based on the decision trees, we propose algorithm of building an optimal information criterion sequential binary partition of n-dimensional feature space of texts on 2 disjoint n-dimensional intervals that make up the styles sheet that defines the «style portrait» (profile sheet) of texts set. It is assumed that space is indicative of the frequency, formed frequencies of appearance in texts set of function words, phrases, bigrams, etc. The algorithm implemented in software in the system «StyleAnalizator» intended for a comprehensive study of texts of various types. On a material of different body of texts, a comparative study of classification quality of texts by authors, genres, styles and other characteristics of the texts on algorithms for decision trees and tables of texts styles is made. The resulting algorithm for training text styles profiles can be used to identify the charge against the style of the text by an unknown author. This allows, in particular, determine the most probable authorship of the text.
Keywords
классификация текстов, деревья решений, таблицы стилей, стилевой профиль, идентификация текстов, text classification, decision trees, styles sheet, styles profile, text identificationAuthors
Name | Organization | |
Kubarev Anton I. | National Research Tomsk State University | kubarev_ai@mail.ru |
Kukushkina Olga V. | M.V. Lomonosov Moscow State University | kukush@orc.ru |
Poddubny Vasiliy V. | National Research Tomsk State University | vvpoddubny@gmail.com |
Shevelyov Oleg G. | National Research Tomsk State University | oshevelyov@gmail.com |
References
