The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2020. № 68. DOI: 10.17223/19986645/68/2

The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language

This article describes a new method of prediction of the morphological paradigm of unknown (which are not in a dictionary) words in the Russian language. Modern morphological analyzers detect the morphological paradigm of a word using a dictionary of word forms, in which each word form has corresponding morphological features. This method is the fastest and the most precise in comparison with others, but has one essential shortcoming - it is limited to the available dictionary and cannot detect a morphological paradigm of unknown words, i.e. those words not given in the dictionary. The method described in the article allows, in the incremental mode, predicting the morphological paradigm of a word. The method is based on the ensemble prediction of the morphological paradigm by a single word form and the consecutive formation of partial paradigms by several word forms. Partial paradigms then are used to compute final prediction. At the first step, ensemble prediction polls several various prediction strategies and forms an intermediate result of prediction. At the second step, the method correlates collected word forms and builds partial paradigms. The partial paradigms, which are filled by word forms to some threshold, then are used to form final result of prediction. At the third step, error correction and new prediction are performed for words whose morphological paradigm cannot be detected. An important advantage of the described method of prediction is that it works in the incremental unsupervised mode, without human intervention. The system is self-learning - the more word forms are met, the quicker and more precise the result of prediction is. Also, prediction algorithms practically do not affect the performance of the overall system. Bulky procedures of preprocessing and precreation of morphological dictionaries for new word forms are not required. To confirm the applicability of the new method, a research was conducted. The research used a text corpus of the Russian language of different genres (approx. 10 bn words). The research was done in two steps. At the first step, the analysis of a partial text corpus (about 1 bn words) was performed. The analysis took into account the frequency distribution of the predicted word forms. Only nouns, adjectives, verbs and adverbs were analyzed. At the second step, the full text corpus was analyzed without frequency distribution. The research confirmed the high precision and performance of the described prediction method. The method is able to predict morphological paradigms of two thirds of all word forms met in the text.

Download file
Counter downloads: 146

Keywords

morphology, morphological analysis, morphological paradigm, prediction of morphological paradigm of unknown words, computational linguistics, corpus research

Authors

NameOrganizationE-mail
Lyukina Elena V.Higher School of Economicseliouki@mail.ru
Lytaeva Maria A.Higher School of Economicslytaeva_ma@mail.ru
Всего: 2

References

Плунгян В.А. Общая морфология: Введение в проблематику. М. : Едиториал. УРСС., 2003. С. 113.
Клышинский Э.С. Начальные этапы анализа текста // Автоматическая обработка текстов на естественном языке и компьютерная лингвистика : учеб. пособие. М., 2011. С. 118.
Зализняк А.А. Грамматический словарь русского языка. М. : Русский язык, 1980.
AOT. URL: http://aot.ru/docs/rusmorph.html
Сокирко А.В. Морфологические модули на сайте www.aot.ru // Труды международной конференции «Диалог-2004. Компьютерная лингвистика и интеллектуальные технологии». М., 2004. С. 559.
Daciuk J. et al. Incremental construction of minimal acyclic finite-state automata // Computational linguistics. 2000. Vol. 26, № 1. Р. 3-16.
Левенштейн В.И. Двоичные коды с исправлением выпадений, вставок и замещений символов // Доклады Академий наук СССР. 1965. Т. 163, № 4. С. 845-848.
Zipf G.K. Human Behavior and the Principle of Least Effort. Addison-Wesley Press, 1949. Р. 484-490, 573.
 The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2020. № 68. DOI: 10.17223/19986645/68/2

The Incremental Prediction of the Morphological Paradigm of Unknown Words in the Russian Language | Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya – Tomsk State University Journal of Philology. 2020. № 68. DOI: 10.17223/19986645/68/2

Download full-text version
Counter downloads: 1185