The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language
This article describes a new method of prediction of the morphological paradigm of unknown (which are not in a dictionary) words in the Russian language. Modern morphological analyzers detect the morphological paradigm of a word using a dictionary of word forms, in which each word form has corresponding morphological features. This method is the fastest and the most precise in comparison with others, but has one essential shortcoming - it is limited to the available dictionary and cannot detect a morphological paradigm of unknown words, i.e. those words not given in the dictionary. The method described in the article allows, in the incremental mode, predicting the morphological paradigm of a word. The method is based on the ensemble prediction of the morphological paradigm by a single word form and the consecutive formation of partial paradigms by several word forms. Partial paradigms then are used to compute final prediction. At the first step, ensemble prediction polls several various prediction strategies and forms an intermediate result of prediction. At the second step, the method correlates collected word forms and builds partial paradigms. The partial paradigms, which are filled by word forms to some threshold, then are used to form final result of prediction. At the third step, error correction and new prediction are performed for words whose morphological paradigm cannot be detected. An important advantage of the described method of prediction is that it works in the incremental unsupervised mode, without human intervention. The system is self-learning - the more word forms are met, the quicker and more precise the result of prediction is. Also, prediction algorithms practically do not affect the performance of the overall system. Bulky procedures of preprocessing and precreation of morphological dictionaries for new word forms are not required. To confirm the applicability of the new method, a research was conducted. The research used a text corpus of the Russian language of different genres (approx. 10 bn words). The research was done in two steps. At the first step, the analysis of a partial text corpus (about 1 bn words) was performed. The analysis took into account the frequency distribution of the predicted word forms. Only nouns, adjectives, verbs and adverbs were analyzed. At the second step, the full text corpus was analyzed without frequency distribution. The research confirmed the high precision and performance of the described prediction method. The method is able to predict morphological paradigms of two thirds of all word forms met in the text.
Keywords
morphology, morphological analysis, morphological paradigm, prediction of morphological paradigm of unknown words, computational linguistics, corpus researchAuthors
Name | Organization | |
Lyukina Elena V. | Higher School of Economics | eliouki@mail.ru |
Lytaeva Maria A. | Higher School of Economics | lytaeva_ma@mail.ru |
References

The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2020. № 68. DOI: 10.17223/19986645/68/2