Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2020. № 50. DOI: 10.17223/19988605/50/2

Using the analysis of semantic proximity of words in solving the problem of determining the genre of texts within deep learning

The relevant objective in the processing of text corpora is the classification of texts by topics and genres. Usually this work is done manually, so processing large text corpora is an extremely long process. Moreover, an unambiguous classification is not always possible: in most cases, the same text can be attributed to several topics and genres, with only one of them being the principal one. Therefore, the full automation of the classification process or limiting the choice of a researcher to the list of the most likely topics and genres is of practical interest. To solve the problem, the authors propose to use convolutional neural networks, which, on the one hand, are efficient in classifications, and, on the other hand, are not used and studied properly for text recognition. To present the data in a form suitable for processing by a convolutional neural network, the word2vec model was chosen. This model allows us to conduct vector representations of words that reflect their semantic proximity. To implement the word2vec model, the Skip-gram architecture was chosen, which, despite the slow learning rate, works well with rare words. Based on the results of numerous experiments, the most optimal model hyperparameters were selected. The output of a trained model is the probability of attribution of a work to each class. Based on the analysis of the obtained results, we can conclude that the proposed model of the convolutional neural network is correct and fairly accurately reflects the literary perception of the genre.

Download file

Counter downloads: 231

Keywords

text natural language processing, word2vec model, convolutional neural networks, machine learning, интеллектуальный анализ текстов, модель word2vec, сверточные нейронные сети, машинное обучение

Authors

Name	Organization	E-mail
Batraeva Inna A.	Saratov State University	BatraevaIA@info.sgu.ru
Nartsev Andrey D.	Saratov State University	narcev.andrey@gmail.com
Lezgyan Artem S.	Saratov State University	lezgyan@yandex.ru

Всего: 3

References

Kamran K., Donald E., Mojtaba H., Kiana J.M., Matthew S., Laura E. HDLTex: Hierarchical Deep Learning for Text Classification. 2017. arXiv preprint arXiv:1709.08267.

Kim Y. Convolutional neural networks for sentence classification // Proc. of the 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP 2014). 2014. P. 1746-1751.

Флах П. Машинное обучение. Наука и искусство построения алгоритмов, который извлекают знания из данных. М. : ДМК Пресс, 2015. 402 с.

Gensim documentation. URL: https://radimrehurek.com/gensim/tutorial.html (accessed: 10.02.2019).

Библиотека Максима Мошкова. URL: http://lib.ru (дата обращения: 20.01.2019).

Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed representations of words and phrases and their compositionality. 2013. arXiv preprint arXiv: 1310.4546

Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine // Proc. of the International Conf. on Machine Learning. Models, Technologies and Applications. MLMTA'03, June 23-26, 2003, Las Vegas, Nevada, USA. P. 1-8.

Train NLTK punkt tokenizers. URL: https://github.com/mhq/train_punkt (accessed: 08.02.2019).

Яндекс технология MyStem. URL: https://tech.yandex.ru/mystem/ (дата обращения: 08.02.2019).

Rong Xin. Word2vec parameter learning explained. 2014. arXiv preprint arXiv: 1411.2738.

NLTK 3.4 documentation. URL: http://www.nltk.org/ (accessed: 08.02.2019).

Yin W., Kann K., Yu M., Schutze H. Comparative study of CNN and RNN for natural language processing. 2017. arXiv preprint arXiv: 1702.01923.

Yogatama D., Dyer Chr., Ling W., Blunsom Ph. Generative and discriminative text classification with recurrent neural networks. 2017. arXiv preprint arXiv: 1703.01898.

Zhang X., Zhao J., LeCun Y. Character-level Convolutional Networks for Text Classification. 2016. arXiv preprint arXiv: 1509.01626.

Conneau A., Schwenk H., Barrault L., LeCun Y. Very deep convolutional networks for text classification. 2017. arXiv preprint arXiv: 1606.01781.

Bai S., Kolter J.Z., Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018. arXiv preprint arXiv: 1803.01271.

Батраева И.А., Крючкова А.А. Разработка программного обеспечения диалектологических корпусов // Компьютерные науки и информационные технологии : материалы Междунар. науч. конф. Саратов : Наука, 2018. С. 45-49.