Topic modeling of prose fiction: Model assessment and interpretability (the case of Russian short stories of the 1900s–1930s)
The article presents two experiments investigating the interpretability of results obtained from automatic topic modeling of literary texts, addressing the broader question of the appropriateness of applying this method to fiction. The relevance of this research is grounded in the successful application of topic modeling to specialized texts, contrasted with the challenges posed by the metaphorical language and thematic complexity of literary works. The study aims to determine how and to what extent the topic distributions produced by the model (story-topic correlations) reflect the thematic aspects of short stories. The research material consisted of 3,000 short stories written by 927 Russian authors, including world-renowned figures such as Nobel laureates I.A. Bunin and M.A. Sholokhov, Russian “classics” like A.P. Chekhov and M. Gorky, as well as lesser-known and nearly forgotten writers. During the study, several samples were generated for each of the three chronological periods: (1) the beginning of the 20th century (1900-1913), (2) the era of wars and revolutions (l9l4-1922), and (3) the early Soviet period. Each period was represented by three samples consisting of 100, 500, and 1,000 short stories. For each sample, models were constructed using the LDA algorithm with various preprocessing options. The evaluation of topic interpretability was conducted in two stages. The first stage aimed to identify which preprocessing steps yielded the best interpretability of the resulting model. The model trained on the corpus without any POS filtering exhibited the highest interpretability, with 25% of the generated topics deemed interpretable by experts. In the second stage, only those 24 topics that were unanimously considered interpretable by all three experts were further analyzed. During the second experiment, two experts read all 127 texts from the resulting sample and evaluated each topic on a three-point scale: (1) fully corresponds, (2) partially corresponds, (3) does not correspond at all. The experiments revealed that the experts identified a good correspondence between the text and the automatically assigned topic in 52% of the short stories, partial correspondence in 24%, while the remaining 24% of the stories appeared unrelated to the assigned topic. Thus, 76% of all interpretable topics demonstrated a meaningful connection to the content of the stories, beyond being merely statistically significant word clusters within the texts. These results are quite promising and suggest that topic modeling can be effectively applied to fiction, allowing researchers to accurately identify typical themes within a collection of short stories without needing to read them all. However, achieving these results requires a careful preliminary selection of the topics generated by the model, ensuring their semantic coherence, as done in the first stage of the experiment. The authors express their gratitude to the members of the Scientific and Educational Group of Interdisciplinary Philological Research of the National Research University Higher School of Economics in St. Petersburg, especially to A.S. Karysheva, E.O. Kolpashchikova, A.Yu. Moskalenko, I.A. Delazari, I.S. Zavyalova, for their active participation and expert assessment at different stages of the research.
Keywords
Russian literature,
Russian short story,
short fiction,
topic modeling,
literary theme,
corpus linguistics,
computational linguistics,
interpretability,
expert assessmentAuthors
Sherstinova Tatiana Yu. | National Research University Higher School of Economics | tsherstinova@hse.ru |
Kirina Margarita A. | National Research University Higher School of Economics | mkirina@hse.ru |
Moskvina Anna D. | National Research University Higher School of Economics | admoskvina@hse.ru |
Всего: 3
References
Sherstinova T., Mitrofanova O., Skrebtsova T., Zamiraylova, E., Kirina M. Topic Modelling with NMF vs. Expert Topic Annotation: The Case Study of Russian Fiction // Advances in Computational Intelligence: 19th Mexican International Conference on Artificial Intelligence, MICAI 2020. P. 2. 2020. Vol. 12469. P. 134-152.
Jockers M.L., Mimno D. Significant themes in 19th-century literature // Poetics. 2013. Vol. 41, № 6. P. 750-769.
Jautze K., Cranenburgh A. van, Koolen C. Topic modeling literary quality // Digital Humanities. Conference Abstracts. 2016. P. 233-237.
Schoch C. Topic modeling genre: an exploration of French classical and enlightenment drama // Digital Humanities Quarterly. 2017. Vol. 11, № 2. 10.48550/arXiv.2103.13019 (дата обращения: 01.05.2021).
Navarro-Colorado B. On Poetic Topic Modeling: Extracting Themes and Motifs From a Corpus of Spanish Poetry // Frontiers in Digital Humanities. 2018. Vol. 5, № 15.
Митрофанова О.А. Исследование структурной организации художественного произведения с помощью тематического моделирования: опыт работы с текстом романа "Мастер и Маргарита" М.А. Булгакова // Корпусная лингвистика 2019. СПб., 2019. С. 387-394.
Корпус русского рассказа 1900-1930. URL: https://russian-short-stories.ru/(дата обращения: 05.12.2022).
Мартыненко Г.Я., Шерстинова Т.Ю., Мельник А.Г., Попова Т.И. Методологические проблемы создания Компьютерной антологии русского рассказа как языкового ресурса для исследования языка и стиля русской художественной прозы в эпоху революционных перемен (первой трети XX века) // Компьютерная лингвистика и вычислительные онтологии. СПб.: НИУ ИТМО, 2018. № 2. С. 97-102.
Мартыненко Г.Я., Шерстинова Т.Ю., Попова Т.И., Мельник А.Г., Замирайлова Е.В. О принципах создания корпуса русского рассказа первой трети XX века // Труды XV Международной конференции по компьютерной и когнитивной лингвистике "TEL 2018". Казань, 2018. С. 180-197.
Sherstinova T., Martynenko G. Linguistic and Stylistic Parameters for the Study of Literary Language in the Corpus of Russian Short Stories of the First Third of the 20th Century // R. Piotrowski's Readings in Language Engineering and Applied Linguistics, Proc. of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019), Saint Petersburg, Russia, November 27, 2019, CEUR Workshop Proceedings. 2020. Vol. 2552. P. 105-120. URL: http://ceur-ws.org/Vol-2552.
Zamiraylova E., Mitrofanova O. Dynamic topic modeling of Russian fiction prose of the first third of the 20th century by means of non-negative matrix factorization // Proc. of the III International Conference on Language Engineering and Applied Linguistics (PRLEAL-2019). 2020. Vol. 2552. P. 321-339.
Skrebtsova T.G. Thematic Tagging of Literary Fiction: The Case of Early 20th Century Russian Short Stories // International Conference "Internet and Modern Society" (IMS-2020). CEUR Workshop Proceedings, 2021. P. 265-276.
Sherstinova T., Kirina M. Normalization Issues in Digital Literary Studies: Spelling, Literary Themes and Biographical Description of Writers // Alexandrov D.A. et al. Digital Transformation and Global Society. DTGS 2021.Communications in Computer and Information Science. Vol. 1503. Cham, 2022. P. 332-346.
Gryaznova E., Kirina M. Defining Types of Violence: Comparing Topic Modeling with Latent Dirichlet Allocation and Principal Component Analysis for Russian Short Stories from the 1900s to the 1930s // Proceedings of the International Conference "Internet and Modern Society" 2021. P. 281-290.
Кирина М.А. Сравнение тематических моделей на основе LDA, STM и NMF для качественного анализа русской художественной прозы малой формы // Вестник НГУ. Серия: Лингвистика и межкультурная коммуникация. 2022. Т. 20, № 2. С. 93-109.
Rhody L.M. Topic Modelling and Figurative Language // Journal of Digital Humanities. 2012.
Da N.Z. The computational case against computational literary studies // Critical Inquiry. 2019. Vol. 45, № 3. P. 601-639.
Uglanova I., Gius E. The Order of Things. A Study on Topic Modelling of Literary Texts // Proc. of the CHR 2020: Workshop on Computational Humanities Research, CEUR Workshop Proceedings. 2020. URL: http://ceur-ws.org/Vol-2723/long7.pdf.
Жолковский А.К., Щеглов Ю.К. К понятиям "тема" и "поэтический мир" // Щеглов Ю.К. Избранные труды / сост. А.К. Жолковский, В. А. Щеглова. М.: РГГУ, 2013. С. 37-78.
Тынянов Ю.Н. Архаисты и новаторы. Л.: Прибой, 1929.
Мартыненко Г.Я. Основы стилеметрии. Л.: Изд-во ЛГУ, 1988.
Мартыненко Г.Я. Методу: математической лингвистики в стилистических исследованиях. М.: Нестор-История, 2019.
Honnibal M., Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. 2017. URL: https://spacy.io/models/ru (дата обращения: 01.05.2021).
Blei D.M., Ng A.Y., Jordan M.I. Latent Dirichlet allocation // The Journal of machine Learning research. 2003. Vol. 3. P. 993-1022.
Daud A., Li J., Zhou L. et al. Knowledge discovery through directed probabilistic topic models: a survey // Front.Comput. Sci. China. 2010. Т. 4, № 3. Р. 280-301.
Митрофанова О.А. Моделирование тематики специальных текстов на основе алгоритма LDA // XLII Международная филологическая конференция. 11-16 марта 2013 г. Избранные труды. СПб., 2014.
Rehhrek R., Sojka P. Software framework for topic modelling with large corpora // Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. 2010.
Blei D.M. Probabilistic topic models // Communications of the ACM. 2012. Vol. 55, № 4. P. 77-84.
Коршунов А., Гомзин А. Тематическое моделирование текстов на естественном языке // Трудяг Института системного программирования РАН. 2012. № 23.
Kherwa P., BansalP. Topic modeling: a comprehensive review // EAI Endorsed transactions on scalable information systems. 2020. Vol. 7, № 24.
Roder M., Both A., Hinneburg A. Exploring the Space of Topic Coherence Measures // Proceedings of the eighth International Conference on Web Search and Data Mining, 2015.
Sherstinova T., Kirina M., Zavyalova I., Karysheva A., Kolpashchikova E., Maksimenko P., Moskalenko A., Moskvina A., Кирина М.А. Topic Modeling of Literary Texts Using LDA: On the Influence of Linguistic Preprocessing on Model Interpretability // 2022 31st Conference of Open Innovations Association (FRUCT). Vol. 32. IEEE, 2022. P. 305-312.
Шерстинова Т.Ю., Москвина А.Д., Кирина М.А. Тематическое моделирование русского рассказа 1900-1930: наиболее частотные темы и их динамика // Компьютерная лингвистика и интеллектуальные технологии: по материалам международной конференции "Диалог 2022". 2022. Вып. 21. С. 512-526.
Chang J., Gerrish S., Wang C., Boyd-Graber J.L., Blei D.M. Reading tea leaves: how humans interpret topic models // Adv. Neural Inf. Process. Syst. 2009. Vol. 22. P. 288-296.
Воронцов К.В., Фрей А.И., Апишев М.А., Потапенко А.А. Тематическое моделирование в BigARTM: теория, алгоритмы, приложения. 2015. URL: http://www.machinelearning.ru/wiki/images/b/bc/Voron-2015-BigARTM.pdf.
Томашевский Б.В. Теория литературы. Поэтика: учеб. пособие. М.: Аспект Пресс, 1996. С. 176-192.
Вершинина Н.Л., Волкова Е.В., Крупчанов Л.М. [и др.] Введение в литературоведение: учеб. для бакалавров. М.: Юрайт, 2015.
Sherstinova T., Moskvina A., Kirina M. Towards automatic modelling of thematic domains of a national literature: Technical issues in the case of Russian // 2021 29th Conference of Open Innovations Association (FRUCT). IEEE, 2021. P. 313-323.