On the Corpus of Dialectal Texts in the Russian National Corpus
The paper deals with the present state of the Corpus of Dialectal Texts within the Russian National Corpus. The Dialectal Corpus is available online at the site http://www.ruscorpora.ru/search-dialect.html. It is searchable since December 2006. As time passed the markup team has changed and so did the tenets of the markup. The new team has developed a new standard of formatting the dialectal texts. According to the latter, texts should be included into the corpus in a phonetic representation, with marked stress, and the user should have the opportunity to work both with fragments and with whole texts. In the paper the main principles of grammatical, semantic and metatext markup of the dialectal texts are described, as well as the guidelines for online search. The metatext markup consists of three levels: 1) provenance of a text; 2) phonetic markup; 3) genre and topic. A subcorpus can be customized basing on any combination of these sets of parameters. It is possible to select the texts with orthographical rendering and/or with audio recording available. All the texts within the Dialectal Subcorpus are with resolved morphological ambiguity, with full morphological markup, including the dialectal characteristics. The search, as within the bulk of the RNC, is operated in two regimes: as an exact (sub)string and by lemmata and grams. The semantic annotation is two-tiered, represented both in the metatext tagging and as a part of the word-by-word annotation. The topic (aboutness) of the text is determined on the metatext level. A separate lexeme can be also tagged semantically. A word is "translated" into the standard Russian language only if the text has a dictionary or notes annexed by the transcriber. It is possible also to add derivates to find other dialectal words with the same root. In 2016 the Dialectal Subcorpus was updated and now has 300 thousand items. The Dialectal Subcorpus of the RNC supposes inclusions of every sort of dialectal texts available in Russian, from the historical Russian area (Central European Russia), from the early colonization area (North European Russia) and late colonization area (Siberia, Far East, the Don, the South Volga), as well as the Russian diaspora, mostly Old Believers and Protestants (Latgale, Azerbaijan, Romania, Australia, Canada, the United States and others). The texts are provided by dialectologists who do fieldwork. They can provide transcripts from their personal notes or audio recordings, as well as published texts. The authors hope that the Dialectal Corpus will soon become a representative collection and will be widely accessed by users.
Keywords
русская диалектология, корпусная лингвистика, лексическая семантика, электронные ресурсы, морфологическая разметка, Russian dialectology, corpus linguistics, lexical semantics, electronic resources, morphological markingAuthors
Name | Organization | |
Kachinskaya Irina B. | Lomonosov Moscow State University | kacza@yandex.ru |
Sichinava Dmitry V. | Vinogradov Institute of the Russian Language of the Russian Academy of Sciences | mitrius@gmail.com |
References

On the Corpus of Dialectal Texts in the Russian National Corpus | Voprosy leksikografii – Russian Journal of Lexicography. 2017. № 11. DOI: 10.17223/22274200/11/5