Assessment of the applied quality of topic models for clustering problems | Vestnik Tomskogo gosudarstvennogo universiteta. Upravlenie, vychislitelnaja tehnika i informatika – Tomsk State University Journal of Control and Computer Science. 2021. № 56. DOI: 10.17223/19988605/56/11

Assessment of the applied quality of topic models for clustering problems

Methods for assessing the quality of thematic models that can ensure their sustainable use for solving practical problems related to the analysis of a set of text documents are investigated. Using the example of the problem of soft clustering, it is shown that using the metric of the average coherence of topics is not enough to assess the applicability of the constructed model, and it is advisable to take into account the indicators of links between documents with highly coherent topics.

Download file

Counter downloads: 58

Keywords

topic modeling, topic coherence, soft clustering, text analysis, ARTM

Authors

Name	Organization	E-mail
Krasnov Fedor V.	NAUMEN R&D	fkrasnov@naumen.ru
Baskakova Elena N.	NAUMEN R&D	enbaskakova@naumen.ru
Smaznevich Irina S.	NAUMEN R&D	ismaznevich@naumen.ru

Всего: 3

References

Rubin T.N., Chambers A., Smyth P., Steyvers M. Statistical topic models for multi-label document classification // Machine learning. 2012. V. 88, № 1-2. P. 157-208.

Янина А., Воронцов К. Мультимодальные тематические модели для разведочного поиска в коллективном блоге // Машинное обучение и анализ данных. 2016. Т. 2, № 2. C. 173-186.

Litvak M., Vanetik N., Liu C., Xiao L., Savas O. Improving summarization quality with topic modeling // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications. 2015. P. 39-47.

Zhang J., Song Y., Zhang C., Liu S. Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora // Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 2010. P. 1079-1088.

Ni X., Sun J.T., Hu J., Chen Z. Mining multilingual topics from wikipedia // Proceedings of the 18th international conference on World wide web. 2009. P. 1155-1156.

Hoffmann T. Probabilistic latent semantic indexing // Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. New York : ACM, 1999. P. 50-57.

Vayansky I., Kumar S.A.P. A review of topic modeling methods // Information Systems. 2020. V. 94. P. 101582.

Blei D.M., Ng A.Y., Jordan M.I. Latent dirichlet allocation // Journal of machine Learning research. 2003. V. 3, № 1. P. 993-1022.

Vorontsov K.V. Additive regularization for topic models of text collections // Doklady Mathematics. 2014. V. 89, № 3. P. 301-304.

Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open source library for regularized multimodal topic modeling of large collections // International Conference on Analysis of Images, Social Networks and Texts. Springer, Cham., 2015. С. 370-381.

Krasnashchok K., Jouili S. Improving topic quality by promoting named entities in topic modeling // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018. V. 2: Short Papers. Р. 247-253.

Omar M. et al. LDA topics: Representation and evaluation // Journal of Information Science. 2015. V. 41, № 5. С. 662-675.

Bhatia S., Lau J.H., Baldwin T. An automatic approach for document-level topic model evaluation // arXiv preprint. arXiv:1706.05140.2017. URL: https://arxiv.org/pdf/1706.05140.pdf

Newman D., Noh Y., Talley E., Karimi S., Baldwin T. Evaluating topic models for digital libraries // Proceedings of the 10th Annual Joint Conference on Digital libraries. 2010. С. 215-224.

Lau J.H., Newman D., Baldwin T. Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality // Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014. С. 530-539.

Mimno D., Wallach H., Talley E., Leenders M., McCallum A. Optimizing semantic coherence in topic models // Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. 2011. P. 262-272.

Aletras N., Stevenson M. Evaluating topic coherence using distributional semantics // Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) : Long Papers. 2013. Р. 13-22.

Lau J.H., Baldwin T. The sensitivity of topic coherence evaluation to topic cardinality // Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016. С. 483487.

Roder M., Both A., Hinneburg A. Exploring the space of topic coherence measures // Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. 2015. С. 399-408.

Bezdek J.C. Cluster validity with fuzzy sets // Journal of Cybernetics. 1973. V. 3 (3). P. 58-73.

Dunn J.C. Well-separated clusters and optimal fuzzy partitions // Journal of cybernetics. 1974. V. 4. No. 1. С. 95-104.

Davies D.L., Bouldin D.W. A cluster separation measure // IEEE transactions on pattern analysis and machine intelligence. 1979. № 2. С. 224-227.

Halkidi M., Batistakis Y., Vazirgiannis M. Clustering validity checking methods: part II // ACM Sigmod Record. 2002. V. 31, № 3. С. 19-27.

Xie X.L., Beni G. A validity measure for fuzzy clustering // IEEE Transactions on Pattern Analysis & Machine Intelligence. 1991. № 8. С. 841-847.

Краснов Ф.В. Оценка оптимального количества тематик в тематической модели: подход на основе качества кластеров // International Journal of Open Information Technologies. 2019. V. 7, № 2. P. 8-15.

Rousseeuw P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis // Journal of Computational and Applied Mathematics. 1987. V. 20. P. 53-65.

Newman D., Lau J.H., Grieser K., Baldwin T. Automatic evaluation of topic coherence // Human language technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2010. P. 100-108.

Bouma G. Normalized (pointwise) mutual information in collocation extraction // Proceedings of GSCL. 2009. P. 31-40.

O’Callaghan D., Greene D., Carthy J., Cunningham P. An analysis of the coherence of descriptors in topic modeling // Expert Systems with Applications. 2015. V. 42, № 13. P. 5645-5657.

Syed S., Spruit M. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation // 2017 IEEE International conference on data science and advanced analytics (DSAA). IEEE, 2017. С. 165-174.

Nikolenko S.I., Koltcov S., Koltsova O. Topic modelling for qualitative studies // Journal of Information Science. 2017. V. 43, № 1. P. 88-102.

Shavrina T., Shapovalova O. To the Methodology of Corpus Construction for Machine Learning: “Taiga” Syntax Tree Corpus and Parser // Корпусная лингвистика : тр. Междунар. конф. СПб., 2017. С. 78-84.