An approach to recognizing named entities using the example of technological terms in a limited training sample | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2022. № 58. DOI: 10.17223/19988605/58/7

An approach to recognizing named entities using the example of technological terms in a limited training sample

The paper considers the problem of recognizing named entities by the example of technological terms, a named entity is a word or phrase denoting an object or phenomena of a certain category. Automatic recognition of technological terms allows companies to optimize business processes. Recognizing named entities for a limited training sample is a non-trivial task. Currently, the standard for recognizing named entities are conditional random field methods (conditional random field, CRF) and bidirectional long-term short-term memory network (bidirectional long-term short-term memory, Bi-LSTM). The paper proposes an approach that is a combination of a statistical (CRF) and a neural network (Bi-SM-CRF) model. The main advantage of using the CRF model is a slight increase in training time against the background of providing additional information for the subsequent Bi-LSTM-CRF model, which will allow you to learn more effectively in a limited sample. Two approaches are used to convert text to feature space: extracting the syntactic properties of words for a statistical model and converting text to a vector using the Sci-Bert language model. Within the framework of the work, a significant improvement in the quality of recognition of technological terms was demonstrated due to the combination of statistical and neural network models of machine learning and the use of a domain-oriented language model for vector representation of scientific texts. This made it possible to improve the quality of recognition of technological terms using the f1-score metric by 12% when training on 800 texts compared to the traditional approach.

Download file

Counter downloads: 42

Keywords

technology term recognition, named entity recognition, model combination, Bi-LSTM (bidirectional long short-term memory), CRF (conditional random field)

Authors

Name	Organization	E-mail
Kulnevich Alexey Dmitrievich	National Research Tomsk State University	kulnevich94@mail.ru
Koshechkin Alexander Alekseevich	National Research Tomsk State University	kaa1994g@mail.ru
Karev Svyatoslav Vasilyevich	National Research Tomsk State University	svyatoslav.karev@live.ru
Zamyatin Alexander Vladimirovich	National Research Tomsk State University	avzamyatin@inbox.ru

Всего: 4

References

Nadeau D., Sekine S. A survey of named entity recognition and classification // Lingvisticae Investigationes. 2007. V. 30, № 1. P. 3-26.

Marrero M. et al. Named entity recognition: fallacies, challenges and opportunities // Computer Standards & Interfaces. 2013. V. 35, № 5. P. 482-489.

Korkontzelos I. et al. Boosting drug named entity recognition using an aggregate classifier // Artificial intelligence in medicine. 2015. V. 65, № 2. P. 145-153.

Lafferty J., McCallum A., Pereira F.C.N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. URL: https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers

Schuster M., Paliwal K.K. Bidirectional recurrent neural networks // IEEE transactions on Signal Processing. 1997. V. 45, № 11. P. 2673-2681.

Jing L., Aixin S., Ray H., Chenliang L. A Survey on Deep Learning for Named Entity Recognition // IEEE Transactions on Knowledge and Data Engineering. 2020. DOI: 10.1109/TKDE.2020.2981314

Hossari M., Dev S., Kelleher D.J., TEST: A Terminology Extraction System for Technology Related Terms // ICCAE 2019, Feb ruary 23-25, 2019. URL: https://arxiv.org/pdf/1812.09541.pdf

Jason P.C., Chiu N.E. Named Entity Recognition with Bidirectional LSTM-CNNs // arXiv preprint:1511.08308v5. 2016. URL: https://arxiv.org/pdf/1511.08308.pdf

Pennington J., Socher R., Christopher D.M. GloVe: Global Vectors for Word Representation / Computer Science Department, Stanford University. 2014. URL: https://nlp.stanford.edu/pubs/glove.pdf

Wang S., Zhou W., Jiang C. A survey of word embeddings based on deep learning // Computing. 2020. V. 102, № 3. P. 717-740.

Wang Y. et al. From static to dynamic word representations: a survey // International Journal of Machine Learning and Cybernetics. 2020. V. 11 (4). P. 1-20.

Devlin J. et al. BERT: Pre-training of deep bidirectional transformers for language understanding // arXiv preprint arXiv:1810.04805. 2018. URL: https://arxiv.org/pdf/1810.04805.pdf

Beltagy I., Lo K., Cohan A. SciBERT: a pretrained language model for scientific text // Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. DOI: 10.18653/v1/D19-1371

Tenney I. et al. What do you learn from context? probing for sentence structure in contextualized word representations // arXiv preprint arXiv:1905.06316. 2019. URL: https://arxiv.org/pdf/1903.10676.pdf

Tenney I., Das D., Pavlick E. BERT rediscovers the classical NLP pipeline // Proc. of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. P. 4593-4601.

Huang Z., Xu W., Yu K. Bidirectional LSTM-CRF models for sequence tagging // arXiv preprint arXiv:1508.01991. 2015. URL: https://arxiv.org/pdf/1508.01991.pdf

Vaswani A. et al. Attention is all you need // arXiv preprint arXiv:1706.03762. 2017. URL: https://arxiv.org/pdf/1706.03762.pdf

Service for the free distribution of articles in the fields of physics, mathematics, computer science and other. URL: https://arxiv.org/(accessed: 22.10.2020).

Stenetorp P. et al. BRAT: a web-based tool for NLP-assisted text annotation // Proc. of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012. P. 102-107.