Building a style sheet using text classification algorithms based on decision trees | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2012. № 56.

Building a style sheet using text classification algorithms based on decision trees

With the use of classification algorithms based on the decision trees, we propose algorithm of building an optimal information criterion sequential binary partition of n-dimensional feature space of texts on 2 disjoint n-dimensional intervals that make up the styles sheet that defines the «style portrait» (profile sheet) of texts set. It is assumed that space is indicative of the frequency, formed frequencies of appearance in texts set of function words, phrases, bigrams, etc. The algorithm implemented in software in the system «StyleAnalizator» intended for a comprehensive study of texts of various types. On a material of different body of texts, a comparative study of classification quality of texts by authors, genres, styles and other characteristics of the texts on algorithms for decision trees and tables of texts styles is made. The resulting algorithm for training text styles profiles can be used to identify the charge against the style of the text by an unknown author. This allows, in particular, determine the most probable authorship of the text.

Download file
Counter downloads: 348

Keywords

классификация текстов, деревья решений, таблицы стилей, стилевой профиль, идентификация текстов, text classification, decision trees, styles sheet, styles profile, text identification

Authors

NameOrganizationE-mail
Kubarev Anton I.National Research Tomsk State Universitykubarev_ai@mail.ru
Kukushkina Olga V.M.V. Lomonosov Moscow State Universitykukush@orc.ru
Poddubny Vasiliy V.National Research Tomsk State Universityvvpoddubny@gmail.com
Shevelyov Oleg G.National Research Tomsk State Universityoshevelyov@gmail.com
Всего: 4

References

Шевелев О.Г. Разработка и исследование алгоритмов сравнения стилей текстовых произведений: автореф. дис.. канд. техн. наук / Том. гос. ун-т. Томск, 2006. 19 с.
Шевелёв О.Г. Методы автоматической классификации текстов на естественном языке: уч. пособие. Томск: ТМЛ-Пресс, 2007. 144 с.
Деревья решений - общие принципы работы. [Электронный ресурс]. URL: http://www. basegroup.ru/library/analysis/tree/description/ (дата обращения 12.09.2011).
Деревья решений - C4.5 математический аппарат. Часть 1. [Электронный ресурс]. URL: http://www.basegroup.ru/library/analysis/tree/math_c45_part1/ (дата обращения 12.09.2011).
Moore A. Statistical Data Mining Tutorials. [Электронный ресурс]. URL: http://www.cs.cmu. edu/~awm/tutorials/ (дата обращения 12.09.2011).
Волькенштейн М.В. Энтропия и информация. М.: Наука, 1986.
Мандель И.Д. Кластерный анализ. М.: Финансы и статистика, 1988.
Дюкова Е.В. Дискретные (логические) процедуры распознавания: принципы конструирования, сложность реализации и основные модели: учебное пособие. М.: Прометей, 2003. 29 с.
Lias-Rodriguez A., Pons-Porrata A. BR: A New Method for Computing All Typical Testors // E. Bayro-Corrachano and J.-O. Eklundh (Eds.): CIARP 2009, LNCS 5856. P. 433-440, 2009.
 Building a style sheet using text classification algorithms based on decision trees | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2012. № 56.

Building a style sheet using text classification algorithms based on decision trees | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2012. № 56.

Download file