Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (9): 16-26    DOI: 10.11925/infotech.2096-3467.2018.1127
Current Issue | Archive | Adv Search |
Review of Automatic Labeling for Topic Models
Hongfei Ling,Shiyan Ou()
School of Information Management, Nanjing University, Nanjing 210023, China
Download: PDF(473 KB)   HTML ( 18
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper reviews methods of automatic topic labeling, aiming to promote the development of topic modelling. [Coverage] We used “Topic Labeling OR Topic Labeling OR Topic Tagging OR Topic Indexing” as search term for the Web of Science and CNKI databases. A total of 57 representative literatures on topic labeling were retrieved. [Methods] We categorized the existing methods and then conducted a comparative analysis for them. [Results] Automatic topic labeling usually had two steps: generating candidate labels from a corpus and then ranking them. These methods can be divided into two categories: label generation based on internal or external corpus. [Limitations] We might not be able to cover everything in this field. [Conclusions] More research could be done in automatic labeling, i.e. those for user-generated contents from social media using deep learning technologies.

Key wordsTopic Labeling      Probabilistic Topic Models      Latent Dirichlet Allocation (LDA)     
Received: 16 October 2018      Published: 23 October 2019
:  G350 TP39  

Cite this article:

Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models. Data Analysis and Knowledge Discovery, 2019, 3(9): 16-26.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1127     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I9/16

词汇 views view materialized maintenance warehouse tables summary updates
出现概率 0.01 0.01 0.05 0.05 0.03 0.02 0.02 0.02
主题语义标签 优点 缺点
词汇集合 简单、易实现 对于特定领域的主题, 不易被用户理解
单个词汇 简单、易实现 表达的语义过于笼统, 无法覆盖“主题-词汇”分布所表达的全部语义信息
单个或多个句子 易于理解, 主题区分度明显 与单个词汇表示法相反, 该方法表达的语义过于具体
短语 介于单个词汇和句子这两种表示方式之间, 是目前使用最多的一种表示主题语义的方式, 易于理解 可能存在一词多义与一义多词的情况
图片 直观, 用户能快速理解, 且与语种无关 不易实现; 一些抽象的概念难以用图片表示
方法 标签类别 候选标签生成方式 候选标签排序方式
Mei等[4] 短语 通过浅层句法解析和n元语法模型从原始语料中
抽取出名词性短语
计算“主题-词汇”分布与候选短语标签的互信息
Kou等[25] 短语 通过浅层句法解析和语块技术从原始语料中抽取
出名词性短语
采用词向量分别表示“主题-文档”分布和候选短语标签, 然后计算两者之间的余弦相似度
Cui等[26] 短语 通过依存句法分析从原始语料的摘要中抽取出名
词性短语
计算“主题-文档”分布和候选“短语标签-文档”分布之间的相对熵(KL距离)
Nolasco等[27] 短语 从特定主题所关联的原始文档中抽取出短语 使用基于词频及其变体的方法为该主题的所有候选短语标签排序
Basave等[8] 句子 通过自动摘要方法从特定主题所关联的原始文档
中抽取出特定长度的文摘句
由自动摘要算法决定
Wan等[29] 句子 通过计算“主题-词汇”分布与原始语料库中每个句
子的相对熵来生成该主题对应的候选句子集合
从语义相关性、全面覆盖性和区分度三个方面对候选句进行综合评分
Mao等[30] 短语 通过n元语法模型从原始语料中抽取出名词性短语 基于主题之间的层次结构, 采用词汇加权和JS散度两种方法计算“主题-词汇”分布与候选短语标签的相似性
外部语料库类别 标签类别 候选标签生成方式 候选标签排序方式
分类目录 单个词汇
或短语[31,32]
将分类目录中已有的主题概念作为候选标签[31,32] 计算“主题-词汇”分布与候选标签的余弦相似度、Jaccard系数等[31,32]
搜索引擎、
维基百科
单个词汇或
短语[20,33-34,36-37]
图片[10,11]
对搜索引擎的搜索结果进行解析, 生成候选
标签[10-11,20,33-34]; 或将维基百科的词条作为候选
标签[20,36-37]
计算“主题-词汇”分布与候选标签的相似度[20,36]; 采用机器学习方法进行排序[11,20]; 采用网络分析中节点中心性度量方法进行选择[10,33-34]; 采用有监督主题模型同时进行主题抽取与标注[37]
概念知识库 单个词汇或
短语[5,21,39-48]
将概念知识库中的概念及其属性作为候选
标签[5,21,39-48]

根据网络分析中的节点中心性度量方法进行选 择[21,39]; 或对主题模型进行扩展[5,40-43]; 或计算知识库中的概念与“主题-词汇”分布的相似度[44,45,46,47,48]
其他外部语料库 单个词汇或
短语[49,50]
使用具有主题相关性的外部语料库自带的
主题标签作为候选标签[49,50]
计算外部语料库主题抽取得到的“主题-词汇”分布与待解决问题的“主题-词汇”分布的相似度, 进行主题标签迁移[49,50]
[1] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] 徐戈, 王厚峰 . 自然语言处理中主题模型的发展[J]. 计算机学报, 2011,34(8):1423-1436.
[2] ( Xu Ge, Wang Houfeng . The Development of Topic Models in Natural Language Processing[J]. Chinese Journal of Computers, 2011,34(8):1423-1436.)
[3] Chang J, Gerrish S, Wang C , et al. Reading Tea Leaves: How Humans Interpret Topic Models [C]//Proceedings of the 2009 International Conference on Neural Information Processing Systems. 2009: 288-296.
[4] Mei Q, Shen X, Zhai C X . Automatic Labeling of Multinomial Topic Models [C]// Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007: 490-499.
[5] Allahyari M, Kochut K . Automatic Topic Labeling Using Ontology-Based Topic Models [C]//Proceedings of the 14th International Conference on Machine Learning and Applications, Miami, Florida, USA. IEEE, 2015: 259-264.
[6] Gourru A, Velcin J, Roche M , et al. United We Stand: Using Multiple Strategies for Topic Labeling [C]//Proceedings of the 23rd International Conference on Applications of Natural Language to Information Systems, Paris, France. Springer, 2018: 352-363.
[7] Lau J H, Newman D, Karimi S , et al. Best Topic Word Selection for Topic Labelling [C]//Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. 2010: 605-613.
[8] Basave A E C, He Y, Xu R . Automatic Labelling of Topic Models Learned from Twitter by Summarisation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 618-624.
[9] Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[10] Aletras N, Stevenson M . Representing Topics Using Images [C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 158-167.
[11] Aletras N, Mittal A . Labeling Topics with Images Using a Neural Network [C]//Proceedings of the 39th European Conference on Information Retrieval. Springer, 2017: 500-505.
[12] Aletras N, Baldwin T, Lau J H , et al. Representing Topics Labels for Exploring Digital Libraries [C]// Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries. IEEE, 2014: 239-248.
[13] Aletras N, Baldwin T, Lau J H , et al. Evaluating Topic Representations for Exploring Document Collections[J]. Journal of the Association for Information Science and Technology, 2017,68(1):154-167.
[14] Sorodoc I, Lau J H, Aletras N , et al. Multimodal Topic Labelling [C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 701-706.
[15] Popescul A, Ungar L H . Automatic Labeling of Document Clusters[OL]. [2019-01-10].https://www.cis.upenn.edu/~ungar/Datamining/Publications/labels.pdf.
[16] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 396-398.
[17] Role F, Nadif M . Beyond Cluster Labeling: Semantic Interpretation of Clusters’ Contents Using a Graph Representation[J]. Knowledge-Based Systems, 2014,56:141-155.
[18] Carmel D, Roitman H, Zwerdling N . Enhancing Cluster Labeling Using Wikipedia [C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009: 139-146.
[19] Tseng Y H . Generic Title Labeling for Clustered Documents[J]. Expert Systems with Applications, 2010,37(3):2247-2254.
doi: 10.1016/j.eswa.2009.07.048
[20] Lau J H, Grieser K, Newman D , et al. Automatic Labelling of Topic Models [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011: 1536-1545.
[21] Hulpus I, Hayes C, Karnstedt M , et al. Unsupervised Graph-Based Topic Labelling Using DBpedia [C]// Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, 2013: 465-474.
[22] Hulpus I, Hayes C, Karnstedt M , et al. An Eigenvalue-Based Measure for Word-Sense Disambiguation [C]// Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida, USA. 2012.
[23] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 2013 International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[24] Huang P S, He X, Gao J , et al. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data [C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 2333-2338.
[25] Kou W, Li F, Baldwin T . Automatic Labelling of Topic Models Using Word Vectors and Letter Trigram Vectors [C]//Proceedings of the 11th Asia Information Retrieval Societies Conference. Springer, 2015: 253-264.
[26] Cui L, Zhang X, Kimpton A , et al. Automatic Labelling of Topics via Analysis of User Summaries [C]// Proceedings of the 27th Australasian Database Conference. Springer, 2016: 295-307.
[27] Nolasco D, Oliveira J . Detecting Knowledge Innovation Through Automatic Topic Labeling on Scholar Data [C]// Proceedings of the 49th Hawaii International Conference on System Sciences. IEEE, 2016: 358-367.
[28] Atapattu T, Falkner K . A Framework for Topic Generation and Labeling from MOOC Discussions [C]// Proceedings of the 3rd ACM Conference on Learning @Scale. ACM, 2016: 201-204.
[29] Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[30] Mao X L, Ming Z Y, Zha Z J , et al. Automatic Labeling Hierarchical Topics [C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2012: 2383-2386.
[31] Magatti D, Calegari S, Ciucci D , et al. Automatic Labeling of Topics [C]//Proceedings of the 9th International Conference on Intelligent Systems Design and Applications. IEEE, 2009: 1227-1232.
[32] Magatti D, Stella F . Probabilistic Topic Discovery and Automatic Document Tagging[A]// Brena R F, Guzman- Arenas A. Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications[M]. IGI Global, 2012: 25-49.
[33] Aletras N, Stevenson M . Labelling Topics Using Unsupervised Graph-based Methods [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 631-636.
[34] Mirzagitova A, Mitrofanova O . Automatic Assignment of Labels in Topic Modelling for Russian Corpora [C]// Proceedings of the 7th Tutorial and Research Workshop on Experimental Linguistics. 2016: 115-118.
[35] Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C]//Proceedings of the 31st International Conference on Machine Learning, Beijing, China. 2014: 1188-1196.
[36] Bhatia S, Lau J H, Baldwin T . Automatic Labelling of Topics with Neural Embeddings[OL]. arXiv Preprint, arXiv: 1612.05340.
[37] Lauscher A, Nanni F, Ruiz Fabo P , et al. Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability[J]. Italian Journal of Computational Linguistics, 2016,2(2):67-88.
[38] Ramage D, Hall D, Nallapati R , et al. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora [C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 248-256.
[39] Aker A, Kurtic E, Balamurali A R , et al. A Graph-Based Approach to Topic Clustering for Online Comments to News [C]//Proceedings of the 38th European Conference on Information Retrieval. Springer, 2016: 15-29.
[40] Allahyari M, Pouriyeh S, Kochut K , et al. A Knowledge-Based Topic Modeling Approach for Automatic Topic Labeling[J]. International Journal of Advanced Computer Science & Applications, 2017,8(9):335-349.
[41] Allahyari M, Kochut K . Using Semantically-Extended LDA Topic Model for Semantic Tagging[J]. International Journal of Semantic Computing, 2016,10(4):503-525.
[42] Allahyari M, Kochut K . Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network [C]// Proceedings of the 10th International Conference on Semantic Computing, Laguna Hills, California, USA. IEEE, 2016: 63-70.
[43] Allahyari M, Kochut K . OntoLDA: An Ontology-Based Topic Model for Automatic Topic Labeling[OL]. [2018-11-18].https://datasciencehub.net/system/files/ds-paper-492.pdf.
[44] Adhitama R, Kusumaningrum R, Gernowo R . Topic Labeling Towards News Document Collection Based on Latent Dirichlet Allocation and Ontology [C]//Proceedings of the 1st International Conference on Informatics and Computational Sciences. IEEE, 2017: 247-252.
[45] Davoudi H, An A . Ontology-Based Topic Labeling and Quality Prediction [C]//Proceedings of the 21st International Symposium on Methodologies for Intelligent Systems. Springer, 2015: 171-179.
[46] Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming to Support Analysis of Software Maintenance Activities [C]//Proceedings of the 33rd International Conference on Software Engineering. 2011.
[47] Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming[J]. Empirical Software Engineering, 2013,18(6):1125-1155.
doi: 10.1007/s10664-012-9209-9
[48] Mehdad Y, Carenini G, Ng R T , et al. Towards Topic Labeling with Phrase Entailment and Aggregation [C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 179-189.
[49] Herzog A, John P, Mikhaylov S J . Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014[OL]. arXiv Preprint, arXiv: 1806.00793.
[50] Mao X L, Hao Y J, Zhou Q , et al. A Novel Fast Framework for Topic Labeling Based on Similarity-Preserved Hashing [C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. 2016: 3339-3348.
[51] Chi J, Ouyang J, Li C , et al. Topic Representation: Finding More Representative Words in Topic Models[OL]. arXiv Preprint, arXiv:1810.10307.
[52] Alkhodair S A, Fung B C M, Rahman O , et al. Improving Interpretations of Topic Modeling in Microblogs[J]. Journal of the Association for Information Science and Technology, 2018,69(4):528-540.
[53] Chang S, Dai P, Chen J , et al. Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts Using Crowdsourcing and Ensemble Learning [C]// Proceedings of the 24th International Conference on World Wide Web. ACM, 2015: 397-406.
[54] Tang W, Wu X, Li Y , et al. A Topic Label Extraction Method for the University BBS [C]//Proceedings of the 1st International Conference on Data Science in Cyberspace, Changsha, China. IEEE, 2016: 678-682.
[55] 周亦鹏, 杜军平 . 基于关联词的主题模型语义标注[J]. 智能系统学报, 2012,7(4):327-332.
[55] ( Zhou Yipeng, Du Junping . Semantic Tagging of a Topic Model Based on Associated Words[J]. CAAI Transactions on Intelligent Systems, 2012,7(4):327-332.)
[56] Arora S, Liang Y, Ma T . A Simple but Tough-to-Beat Baseline for Sentence Embeddings [C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[57] Yang Z, Zhu C, Chen W . Zero-Training Sentence Embedding via Orthogonal Basis[OL]. arXiv Preprint, arXiv: 1810.00438.
[1] Li Xiangdong, He Haihong, Cao Huan, Huang Li. An Algorithm of Digital Resources Text Categorization for Training Sets Skewed Distribution[J]. 现代图书情报技术, 2014, 30(7): 24-33.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn