The Oral Speech Corpus of Russian-Turkic Bilinguals of Southern Siberia: The Marking of Deviations from the Speech Standard
The article describes approaches to the marking of deviations from the speech standard (error annotation) in the corpus of oral speech of Russian-Turkic bilinguals of Southern Siberia. The texts of the corpus are recordings of oral spontaneous speech of Russian-Turkic bilinguals, as a result of which the corpus is bimodal, the sound of speech is synchronized with the transcription, which is realized using ELAN. Morphological annotation is performed automatically using the Mystem console program of the Yandex company with further manual correction. Along with the traditional morphological annotation, the corpus contains the annotation of the so-called “errors” (error annotation). Two types of labeling “errors” are applied, the first in accordance with the levels of the language system, the second in accordance with their sources. The article describes tags, provides fragments of texts with these tags. The key corpus tags are: Phon - phonetics, Synt - syntax, Morph - morphology, Lex - lexis, Der - derivation, Disk - discourse, Sem - semantics, Accaccent, Infl - inflexion, Aff - affix,; Decl - declension, Agr - agreement,; Gov - government, Id - idiom, Prep - preposition, preposition; Gen - Gender, Num - number, Сon - construction; Red - reduction, etc. The annotation of deviations from the speech standard also includes tags of the source of the deviation. This type of marking is determined by the focus of the created corpus of texts on fixing speech practices of bilinguals, on identifying the factors that determine the nature and degree of interference manifestation at all language levels. Translingual and intralingual influence is annotated in the corpus. The influence of the norms of the regional variant of the Russian language is marked as an intralingual one. Regionally determined deviations from the speech standard are marked with a [Reg] - a regional tag. This tag unites all manifestations of regional variants - Siberian dialects, the Siberian variant of urban vernacular, regional variants of the literary language. Deviations from the speech standard, which are a manifestation of the influence of the features of the bilingual’s mother tongue, are indicated by [Int] - interference. The peculiarities of oral colloquial speech in the corpus of texts of Russian-Turkic bilinguals are not marked with additional tags. The article provides examples of annotation using two types of tags. The morphological annotation and marking of deviations from the speech standard define a wide range of the search engine of the corpus of oral speech of the Russian-Turkic bilinguals of Southern Siberia. Connecting to the body corpus metamarkup system expands the search capabilities by matching the types of deviations with the types of bilingualism and the sociocultural types of speakers.
Keywords
билингвизм, корпус текстов, аннотирование, отклонения от речевого стандарта, интерференция, тюркские языки, русский язык, сибирские говоры, bilingualism, corpus of texts, annotation, deviations from the speech standard, interference, Turkic languages, Russian, Siberian dialectsAuthors
Name | Organization | |
Rezanova Zoya I. | Tomsk State University | rezanovazi@mail.ru |
References

The Oral Speech Corpus of Russian-Turkic Bilinguals of Southern Siberia: The Marking of Deviations from the Speech Standard | Voprosy leksikografii – Russian Journal of Lexicography. 2019. № 15. DOI: 10.17223/22274200/15/8