Comparison of stream-based fiction text classification methods based on data compression and counting substrings
We consider the problem of comparing the quality of the natural language text classification based on the stream methods using R-measure and PPM compression. The task of text classification requires a certain amount of texts for each class (genre, author, etc.) in the classifier training. The process of classifier training is reduced to a concatenation of all texts in supertext by each class. In stream-based text classification X is a sequence (stream) of n elements (characters, words, phrases, etc.) [a]. A matching of the test text to any class, in cases of R-measure, is performed through maximum measure of closeness to supertext of a class. In the PPM method, the matching is determined by the smallest difference in lengths of a compressed concatenation of supertext and the test text by the length of the compressed supertext. The truncated R-measure [a] counts all the test text (length n) substrings with the lengths from к1 to к2 in the supertext (unlike the basic method with k 1 = 1 and k 2 = n): кг R(X | S) = r(X | S)/N, r (X | S) = J c k (X | S), N = (2(n +1) -(k 1 + k 2))(k 2 -k x +1)/2, к=k 1 c k (X | S) = Jc(...x„ | S), c((k+,...x n | S) = Ю' ' k+ ... " ^ ' i=k i-к+1... n ^ . where Х is the test text, S is the supertext of the examined class, к is the length of the search substring, N is the number of substrings. The PPM method is a context-dependent modeling with limited order, which allows to estimate the probability of a symbol appearance depending on the previous symbols [b]. The probability estimate using a context of length N is a context-limited model of order N (order-N, O-N). A good probability estimation of symbol appearance requires to consider contexts with different lengths. The calculated estimates are encoded by any entropy encoder, allowing to compress a text. Initially, a classification algorithm obtains the class super-text compressed by PPM method, then the concatenation of supertext, and finely the test text should be compressed in the same manner. Both classifiers were tested by two text specific samples: the Russian prose texts of the 19th century and the 90s of the 20th century. The first sample contains only 9 authors representing the classes, the second sample - 21 authors. Each sample contains about 100 texts that make a training sample. The supertexts formed by each class texts were normalized by volume. Test samples were obtained from texts that were out of training sample, except two texts for each class, and they were normalized by the number and volume (about 100 thousand characters for a test text). To test the stream classification quality software modules that allow classify texts by R-measure and PPM method have been designed and implemented in C#. The problem of the kj and k 2 parameter selection in R-measure has been solved by natural language features. The minimum length of substrings (k 1) is equal to 10 characters because a shorter substrings can match with a lot of words in Russian language, which are common for all authors, and that can reduce stylistic features detection of different authors. The maximum length of the substrings (k 2) is 45 characters that allows to process even a phrase with 3 or 4 words. Using a greater length of substring seems not so useful because such long phrase cannot be repeated many times by various authors. Note that the problem of plagiarism is completely excluded from our consideration. Classifier Module based on the PPM method uses the PPMD algorithm of the order 5. Classification quality for each class is estimated by F-measure, namely, the harmonic mean of precision and recall [a]. The mean of F-measure values by all classes is taken as the estimate of the stream-based classification quality in general. The classifier quality characteristics for each class are obtained for the test samples. The means of F-measure for each classifier are calculated and the conclusions about the classification quality are made. The factors that reduce the classification quality were described for both methods.
Keywords
классификация текстов, художественные произведения, R-мера, PPM метод, F-мера, text classification, fiction classification, R-measure, PPM method, F-measureAuthors
Name | Organization | |
Ashurov Mihail F. | Tomsk State University | therevenge.amf@gmail.com |
References

Comparison of stream-based fiction text classification methods based on data compression and counting substrings | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2014. № 4(29).