Approach to transforming training data for improving the title generation performance for scientific texts | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2022. № 59. DOI: 10.17223/19988605/59/11

Approach to transforming training data for improving the title generation performance for scientific texts

Due to the significant increase in the availability of scientific resources and their expansion, the analysis and systematization of scientific documents become an important task of natural language processing. Scientific articles contain much significant diverse information. Besides, their amount is constantly increasing, and tracking actual scientific publications takes a lot of time. Reducing the number of viewed documents and their generalization is possible using special tools for automatic text processing, including text classification, information extraction, and text summarization. As regards the summarization of scientific documents, one of the particular problems is the generation of the title for the scientific paper. Taking into account the large volumes of scientific resources, the title is especially significant. The title accuracy affects the visibility of the paper by the scientific community and therefore the number of prospective readers. Moreover, some recent studies showed that the quality of the paper title influences the number of citations. Despite this, the authors often spend not enough time creating a good title, which makes it noninformative and non-reflecting the content of the article. To overcome this weakness, the methods of automatic title generation for scientific texts can be developed and used. In this work, we propose an approach to improving the quality of title generation for scientific texts. The proposed approach uses training data filtering and generates new training examples. We consider the following steps: 1) determining recall-oriented ROUGE-1 scores between titles and source texts from the training set. These scores show how many words from the title came from the text. Thus, we can conclude the content correspondence between the title and the source text; 2) ranking examples of the training sample by the recall-oriented scores; 3) filtering examples having scores less than the threshold value k (k E [0; 1)); 4) training model for title generation on the filtered training sample; 5) enriching the filtered training sample to the original size with the pseudo examples generated from the trained model. These examples are generated only for examples removed in the previous step. The approach was tested on two English corpora of scientific texts (SciTLDR and arXiv). We used scientific abstracts as a source for text summarization. We evaluated the values of k in the range from 0,3 to 0,9 in increments of 0,1. In most cases, the results showed that the use of a training sample consisting of filtered and pseudo examples increases the performance of the title generation in comparison with the generation using the original training sample. In our experiments, the most preferred values of the threshold k were 0,7 and 0,8. Experiments were conducted using the BART-base model. The author declares no conflicts of interests.

Download file

Counter downloads: 35

Keywords

natural language processing, automatic text summarization, analysis of scientific texts, title generation, BART

Authors

Name	Organization	E-mail
Glazkova Anna V.	Tyumen State University	a.v.glazkova@utmn.ru

Всего: 1

References

El-Kassas W. S. et al. Automatic text summarization: a comprehensive survey // Expert Systems with Applications. 2021. V. 165. Art. 113679.

Nallapati R., Zhai F., Zhou B. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents // Thirty-First AAAI Conference on Artificial Intelligence. 2017. P. 2101-2110.

Chen J., Zhuge H. Extractive summarization of documents with images based on multi-modal RNN // Future Generation Computer Systems. 2019. V. 99. P. 186-196.

Song S., Huang H., Ruan T. Abstractive text summarization using LSTM-CNN based deep learning // Multimedia Tools and Applications. 2019. V. 78 (1). P. 857-875.

Hanunggul P.M., Suyanto S. The impact of local attention in LSTM for abstractive text summarization // International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). 2019. P. 54-57.

Allahyari M. et al. Text Summarization Techniques: a Brief Survey // International Journal of Advanced Computer Science and Applications (IJACSA). 2017. V. 8 (10). P. 397-405.

Lin C.Y. Rouge: A package for automatic evaluation of summaries // Text summarization branches out. Barcelona, 2004. P. 74-81.

Zhang T. et al. BERTScore: Evaluating Text Generation with BERT // International Conference on Learning Representations. 2020. URL: https://arxiv.org/pdf/1904.09675v1.pdf

Papineni K. et al. BLEU: a method for automatic evaluation of machine translation // Proc. of the 40th annual meeting of the Association for Computational Linguistics. 2002. P. 311-318.

Lewis M. et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension // Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. P. 7871-7880.

Zhang J. et al. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization // International Conference on Machine Learning. 2020. P. 11328-11339.

Raffel C. et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer // Journal of Machine Learning Research. 2020. V. 21. P. 1-67.

Liu Y., Lapata M. Text Summarization with Pretrained Encoders // Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. P. 3730-3740.

Shen S.Q. et al. Recent advances on neural headline generation // Journal of Computer Science and Technology. 2017. V. 32 (4). P. 768-784.

Zhang R. et al. Question headline generation for news articles // Proc. of the 27th ACM international conference on information and knowledge management. 2018. P. 617-626.

Gavrilov D., Kalaidin P., Malykh V. Self-attentive model for headline generation // European Conference on Information Retrieval. 2019. P. 87-93.

Bukhtiyarov A., Gusev I. Advances of Transformer-Based Models for News Headline Generation // Conference on Artificial Intelligence and Natural Language. 2020. P. 54-61.

Putra J.W.G., Khodra M.L. Automatic title generation in scientific articles for authorship assistance: a summarization approach // Journal of ICT Research and Applications. 2017. № 11 (3). P. 253-267.

Fox C.W., Bums C.S. The relationship between manuscript title structure and success: editorial decisions and citation performance for an ecological journal // Ecology and Evolution. 2015. № 5 (10). P. 1970-1980.

Matsumaru K., Takase S., Okazaki N. Improving Truthfulness of Headline Generation // Proc. of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. P. 1335-1346.

Harmon J.E., Gross A.G. The structure of scientific titles // Journal of Technical Writing and Communication. 2009. V. 39 (4). P. 455-465.

Soler V. Writing titles in science: An exploratory study // English for specific purposes. 2007. V. 26 (1). P. 90-102.

Суворова С.А. Лексическая детерминированность заголовков научных статей // Ученые записки Крымского федерального университета им. В.И. Вернадского. Филологические науки. 2011. Т. 24, № 1-1. C. 163-166.

Филоненко Т.А. Аттрактивные заголовки в научной речи // Известия Самарского научного центра Российской академии наук. Социальные, гуманитарные, медико-биологические науки. 2008. Т. 10, № 6-2. С. 290-296.

Cachola I. et al. TLDR: Extreme Summarization of Scientific Documents // Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 2020. P. 4766-4777.

Devlin J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc. of NAACL-HLT. 2019. P. 4171-4186.

Radford A. et al. Language models are unsupervised multitask learners // OpenAI blog. 2019. V. 1 (8). P. 9.

Paszke A. et al. Pytorch: An imperative style, high-performance deep learning library // Advances in Neural Information Processing Systems. 2019. V. 32. P. 8026-8037.

Wolf T. et al. Transformers: State-of-the-art natural language processing // Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. P. 38-45.

Liu Y. et al. Roberta: a robustly optimized bert pretraining approach // arXiv preprint arXiv:1907.11692. 2019.

Williams A., Nangia N., Bowman S. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference // Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018. V. 1: Long Papers. P. 1112-1122.