【目的】对面向主题模型的主题自动语义标注方法进行总结与评述, 以促进主题模型的发展与应用。 【文献范围】在Web of Science和CNKI 数据库中分别以“Topic Labeling OR Topic Labelling OR Topic Tagging OR Topic Indexing”和“主题模型 AND (标注 OR 标签)”等检索式进行检索, 通过手工筛选获得代表性文献 57篇。【方法】对相关论文进行深入阅读与分析, 以主题标注过程中主题标签的生成来源为线索, 对已有方法进行分 类与比较分析。【结果】面向主题模型的主题自动语义标注包括候选标签生成与排序两个主要步骤, 根据候选标签的生成来源可分为依靠自身语料库和依靠外部语料库两类方法。【局限】目前该领域的研究还不是很丰富, 分析与评述不够系统和全面。【结论】该领域的研究仍具有较大探索空间, 面向社交媒体内容的主题语义标注是未来研究方向, 可结合更丰富的知识库并采用深度学习技术进行改进提升。
[Objective] This paper reviews methods of automatic topic labeling, aiming to promote the development of topic modelling. [Coverage] We used “Topic Labeling OR Topic Labeling OR Topic Tagging OR Topic Indexing” as search term for the Web of Science and CNKI databases. A total of 57 representative literatures on topic labeling were retrieved. [Methods] We categorized the existing methods and then conducted a comparative analysis for them. [Results] Automatic topic labeling usually had two steps: generating candidate labels from a corpus and then ranking them. These methods can be divided into two categories: label generation based on internal or external corpus. [Limitations] We might not be able to cover everything in this field. [Conclusions] More research could be done in automatic labeling, i.e. those for user-generated contents from social media using deep learning technologies.
( Xu Ge, Wang Houfeng . The Development of Topic Models in Natural Language Processing[J]. Chinese Journal of Computers, 2011,34(8):1423-1436.)
[3]
Chang J, Gerrish S, Wang C , et al. Reading Tea Leaves: How Humans Interpret Topic Models [C]//Proceedings of the 2009 International Conference on Neural Information Processing Systems. 2009: 288-296.
[4]
Mei Q, Shen X, Zhai C X . Automatic Labeling of Multinomial Topic Models [C]// Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007: 490-499.
[5]
Allahyari M, Kochut K . Automatic Topic Labeling Using Ontology-Based Topic Models [C]//Proceedings of the 14th International Conference on Machine Learning and Applications, Miami, Florida, USA. IEEE, 2015: 259-264.
[6]
Gourru A, Velcin J, Roche M , et al. United We Stand: Using Multiple Strategies for Topic Labeling [C]//Proceedings of the 23rd International Conference on Applications of Natural Language to Information Systems, Paris, France. Springer, 2018: 352-363.
[7]
Lau J H, Newman D, Karimi S , et al. Best Topic Word Selection for Topic Labelling [C]//Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. 2010: 605-613.
[8]
Basave A E C, He Y, Xu R . Automatic Labelling of Topic Models Learned from Twitter by Summarisation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 618-624.
[9]
Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[10]
Aletras N, Stevenson M . Representing Topics Using Images [C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 158-167.
[11]
Aletras N, Mittal A . Labeling Topics with Images Using a Neural Network [C]//Proceedings of the 39th European Conference on Information Retrieval. Springer, 2017: 500-505.
[12]
Aletras N, Baldwin T, Lau J H , et al. Representing Topics Labels for Exploring Digital Libraries [C]// Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries. IEEE, 2014: 239-248.
[13]
Aletras N, Baldwin T, Lau J H , et al. Evaluating Topic Representations for Exploring Document Collections[J]. Journal of the Association for Information Science and Technology, 2017,68(1):154-167.
[14]
Sorodoc I, Lau J H, Aletras N , et al. Multimodal Topic Labelling [C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 701-706.
[15]
Popescul A, Ungar L H . Automatic Labeling of Document Clusters[OL]. [2019-01-10].https://www.cis.upenn.edu/~ungar/Datamining/Publications/labels.pdf.
[16]
Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 396-398.
[17]
Role F, Nadif M . Beyond Cluster Labeling: Semantic Interpretation of Clusters’ Contents Using a Graph Representation[J]. Knowledge-Based Systems, 2014,56:141-155.
[18]
Carmel D, Roitman H, Zwerdling N . Enhancing Cluster Labeling Using Wikipedia [C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009: 139-146.
[19]
Tseng Y H . Generic Title Labeling for Clustered Documents[J]. Expert Systems with Applications, 2010,37(3):2247-2254.
doi: 10.1016/j.eswa.2009.07.048
[20]
Lau J H, Grieser K, Newman D , et al. Automatic Labelling of Topic Models [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011: 1536-1545.
[21]
Hulpus I, Hayes C, Karnstedt M , et al. Unsupervised Graph-Based Topic Labelling Using DBpedia [C]// Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, 2013: 465-474.
[22]
Hulpus I, Hayes C, Karnstedt M , et al. An Eigenvalue-Based Measure for Word-Sense Disambiguation [C]// Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida, USA. 2012.
[23]
Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 2013 International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[24]
Huang P S, He X, Gao J , et al. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data [C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 2333-2338.
[25]
Kou W, Li F, Baldwin T . Automatic Labelling of Topic Models Using Word Vectors and Letter Trigram Vectors [C]//Proceedings of the 11th Asia Information Retrieval Societies Conference. Springer, 2015: 253-264.
[26]
Cui L, Zhang X, Kimpton A , et al. Automatic Labelling of Topics via Analysis of User Summaries [C]// Proceedings of the 27th Australasian Database Conference. Springer, 2016: 295-307.
[27]
Nolasco D, Oliveira J . Detecting Knowledge Innovation Through Automatic Topic Labeling on Scholar Data [C]// Proceedings of the 49th Hawaii International Conference on System Sciences. IEEE, 2016: 358-367.
[28]
Atapattu T, Falkner K . A Framework for Topic Generation and Labeling from MOOC Discussions [C]// Proceedings of the 3rd ACM Conference on Learning @Scale. ACM, 2016: 201-204.
[29]
Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[30]
Mao X L, Ming Z Y, Zha Z J , et al. Automatic Labeling Hierarchical Topics [C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2012: 2383-2386.
[31]
Magatti D, Calegari S, Ciucci D , et al. Automatic Labeling of Topics [C]//Proceedings of the 9th International Conference on Intelligent Systems Design and Applications. IEEE, 2009: 1227-1232.
[32]
Magatti D, Stella F . Probabilistic Topic Discovery and Automatic Document Tagging[A]// Brena R F, Guzman- Arenas A. Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications[M]. IGI Global, 2012: 25-49.
[33]
Aletras N, Stevenson M . Labelling Topics Using Unsupervised Graph-based Methods [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 631-636.
[34]
Mirzagitova A, Mitrofanova O . Automatic Assignment of Labels in Topic Modelling for Russian Corpora [C]// Proceedings of the 7th Tutorial and Research Workshop on Experimental Linguistics. 2016: 115-118.
[35]
Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C]//Proceedings of the 31st International Conference on Machine Learning, Beijing, China. 2014: 1188-1196.
[36]
Bhatia S, Lau J H, Baldwin T . Automatic Labelling of Topics with Neural Embeddings[OL]. arXiv Preprint, arXiv: 1612.05340.
[37]
Lauscher A, Nanni F, Ruiz Fabo P , et al. Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability[J]. Italian Journal of Computational Linguistics, 2016,2(2):67-88.
[38]
Ramage D, Hall D, Nallapati R , et al. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora [C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 248-256.
[39]
Aker A, Kurtic E, Balamurali A R , et al. A Graph-Based Approach to Topic Clustering for Online Comments to News [C]//Proceedings of the 38th European Conference on Information Retrieval. Springer, 2016: 15-29.
[40]
Allahyari M, Pouriyeh S, Kochut K , et al. A Knowledge-Based Topic Modeling Approach for Automatic Topic Labeling[J]. International Journal of Advanced Computer Science & Applications, 2017,8(9):335-349.
[41]
Allahyari M, Kochut K . Using Semantically-Extended LDA Topic Model for Semantic Tagging[J]. International Journal of Semantic Computing, 2016,10(4):503-525.
[42]
Allahyari M, Kochut K . Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network [C]// Proceedings of the 10th International Conference on Semantic Computing, Laguna Hills, California, USA. IEEE, 2016: 63-70.
[43]
Allahyari M, Kochut K . OntoLDA: An Ontology-Based Topic Model for Automatic Topic Labeling[OL]. [2018-11-18].https://datasciencehub.net/system/files/ds-paper-492.pdf.
[44]
Adhitama R, Kusumaningrum R, Gernowo R . Topic Labeling Towards News Document Collection Based on Latent Dirichlet Allocation and Ontology [C]//Proceedings of the 1st International Conference on Informatics and Computational Sciences. IEEE, 2017: 247-252.
[45]
Davoudi H, An A . Ontology-Based Topic Labeling and Quality Prediction [C]//Proceedings of the 21st International Symposium on Methodologies for Intelligent Systems. Springer, 2015: 171-179.
[46]
Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming to Support Analysis of Software Maintenance Activities [C]//Proceedings of the 33rd International Conference on Software Engineering. 2011.
[47]
Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming[J]. Empirical Software Engineering, 2013,18(6):1125-1155.
doi: 10.1007/s10664-012-9209-9
[48]
Mehdad Y, Carenini G, Ng R T , et al. Towards Topic Labeling with Phrase Entailment and Aggregation [C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 179-189.
[49]
Herzog A, John P, Mikhaylov S J . Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014[OL]. arXiv Preprint, arXiv: 1806.00793.
[50]
Mao X L, Hao Y J, Zhou Q , et al. A Novel Fast Framework for Topic Labeling Based on Similarity-Preserved Hashing [C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. 2016: 3339-3348.
[51]
Chi J, Ouyang J, Li C , et al. Topic Representation: Finding More Representative Words in Topic Models[OL]. arXiv Preprint, arXiv:1810.10307.
[52]
Alkhodair S A, Fung B C M, Rahman O , et al. Improving Interpretations of Topic Modeling in Microblogs[J]. Journal of the Association for Information Science and Technology, 2018,69(4):528-540.
[53]
Chang S, Dai P, Chen J , et al. Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts Using Crowdsourcing and Ensemble Learning [C]// Proceedings of the 24th International Conference on World Wide Web. ACM, 2015: 397-406.
[54]
Tang W, Wu X, Li Y , et al. A Topic Label Extraction Method for the University BBS [C]//Proceedings of the 1st International Conference on Data Science in Cyberspace, Changsha, China. IEEE, 2016: 678-682.
( Zhou Yipeng, Du Junping . Semantic Tagging of a Topic Model Based on Associated Words[J]. CAAI Transactions on Intelligent Systems, 2012,7(4):327-332.)
[56]
Arora S, Liang Y, Ma T . A Simple but Tough-to-Beat Baseline for Sentence Embeddings [C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[57]
Yang Z, Zhu C, Chen W . Zero-Training Sentence Embedding via Orthogonal Basis[OL]. arXiv Preprint, arXiv: 1810.00438.