Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (9): 16-26     https://doi.org/10.11925/infotech.2096-3467.2018.1127
     综述评介 本期目录 | 过刊浏览 | 高级检索 |
面向主题模型的主题自动语义标注研究综述 *
凌洪飞,欧石燕()
南京大学信息管理学院 南京 210023
Review of Automatic Labeling for Topic Models
Hongfei Ling,Shiyan Ou()
School of Information Management, Nanjing University, Nanjing 210023, China
全文: PDF (473 KB)   HTML ( 27
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】对面向主题模型的主题自动语义标注方法进行总结与评述, 以促进主题模型的发展与应用。 【文献范围】在Web of Science和CNKI 数据库中分别以“Topic Labeling OR Topic Labelling OR Topic Tagging OR Topic Indexing”和“主题模型 AND (标注 OR 标签)”等检索式进行检索, 通过手工筛选获得代表性文献 57篇。【方法】对相关论文进行深入阅读与分析, 以主题标注过程中主题标签的生成来源为线索, 对已有方法进行分 类与比较分析。【结果】面向主题模型的主题自动语义标注包括候选标签生成与排序两个主要步骤, 根据候选标签的生成来源可分为依靠自身语料库和依靠外部语料库两类方法。【局限】目前该领域的研究还不是很丰富, 分析与评述不够系统和全面。【结论】该领域的研究仍具有较大探索空间, 面向社交媒体内容的主题语义标注是未来研究方向, 可结合更丰富的知识库并采用深度学习技术进行改进提升。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
凌洪飞
欧石燕
关键词 主题语义标注概率主题模型隐含狄利克雷分布    
Abstract

[Objective] This paper reviews methods of automatic topic labeling, aiming to promote the development of topic modelling. [Coverage] We used “Topic Labeling OR Topic Labeling OR Topic Tagging OR Topic Indexing” as search term for the Web of Science and CNKI databases. A total of 57 representative literatures on topic labeling were retrieved. [Methods] We categorized the existing methods and then conducted a comparative analysis for them. [Results] Automatic topic labeling usually had two steps: generating candidate labels from a corpus and then ranking them. These methods can be divided into two categories: label generation based on internal or external corpus. [Limitations] We might not be able to cover everything in this field. [Conclusions] More research could be done in automatic labeling, i.e. those for user-generated contents from social media using deep learning technologies.

Key wordsTopic Labeling    Probabilistic Topic Models    Latent Dirichlet Allocation (LDA)
收稿日期: 2018-10-16      出版日期: 2019-10-23
ZTFLH:  G350 TP39  
基金资助:*本文系国家社会科学基金重点项目“基于关联数据的学术文献内容语义发布及其应用研究”(项目编号: 17ATQ001)
引用本文:   
凌洪飞,欧石燕. 面向主题模型的主题自动语义标注研究综述 *[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models. Data Analysis and Knowledge Discovery, 2019, 3(9): 16-26.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1127      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I9/16
词汇 views view materialized maintenance warehouse tables summary updates
出现概率 0.01 0.01 0.05 0.05 0.03 0.02 0.02 0.02
  关于“物化视图”的“主题-词汇”概率分布
主题语义标签 优点 缺点
词汇集合 简单、易实现 对于特定领域的主题, 不易被用户理解
单个词汇 简单、易实现 表达的语义过于笼统, 无法覆盖“主题-词汇”分布所表达的全部语义信息
单个或多个句子 易于理解, 主题区分度明显 与单个词汇表示法相反, 该方法表达的语义过于具体
短语 介于单个词汇和句子这两种表示方式之间, 是目前使用最多的一种表示主题语义的方式, 易于理解 可能存在一词多义与一义多词的情况
图片 直观, 用户能快速理解, 且与语种无关 不易实现; 一些抽象的概念难以用图片表示
  主题语义标签的种类及其优缺点
方法 标签类别 候选标签生成方式 候选标签排序方式
Mei等[4] 短语 通过浅层句法解析和n元语法模型从原始语料中
抽取出名词性短语
计算“主题-词汇”分布与候选短语标签的互信息
Kou等[25] 短语 通过浅层句法解析和语块技术从原始语料中抽取
出名词性短语
采用词向量分别表示“主题-文档”分布和候选短语标签, 然后计算两者之间的余弦相似度
Cui等[26] 短语 通过依存句法分析从原始语料的摘要中抽取出名
词性短语
计算“主题-文档”分布和候选“短语标签-文档”分布之间的相对熵(KL距离)
Nolasco等[27] 短语 从特定主题所关联的原始文档中抽取出短语 使用基于词频及其变体的方法为该主题的所有候选短语标签排序
Basave等[8] 句子 通过自动摘要方法从特定主题所关联的原始文档
中抽取出特定长度的文摘句
由自动摘要算法决定
Wan等[29] 句子 通过计算“主题-词汇”分布与原始语料库中每个句
子的相对熵来生成该主题对应的候选句子集合
从语义相关性、全面覆盖性和区分度三个方面对候选句进行综合评分
Mao等[30] 短语 通过n元语法模型从原始语料中抽取出名词性短语 基于主题之间的层次结构, 采用词汇加权和JS散度两种方法计算“主题-词汇”分布与候选短语标签的相似性
  基于自身语料库的主题自动语义标注方法汇总
外部语料库类别 标签类别 候选标签生成方式 候选标签排序方式
分类目录 单个词汇
或短语[31,32]
将分类目录中已有的主题概念作为候选标签[31,32] 计算“主题-词汇”分布与候选标签的余弦相似度、Jaccard系数等[31,32]
搜索引擎、
维基百科
单个词汇或
短语[20,33-34,36-37]
图片[10,11]
对搜索引擎的搜索结果进行解析, 生成候选
标签[10-11,20,33-34]; 或将维基百科的词条作为候选
标签[20,36-37]
计算“主题-词汇”分布与候选标签的相似度[20,36]; 采用机器学习方法进行排序[11,20]; 采用网络分析中节点中心性度量方法进行选择[10,33-34]; 采用有监督主题模型同时进行主题抽取与标注[37]
概念知识库 单个词汇或
短语[5,21,39-48]
将概念知识库中的概念及其属性作为候选
标签[5,21,39-48]

根据网络分析中的节点中心性度量方法进行选 择[21,39]; 或对主题模型进行扩展[5,40-43]; 或计算知识库中的概念与“主题-词汇”分布的相似度[44,45,46,47,48]
其他外部语料库 单个词汇或
短语[49,50]
使用具有主题相关性的外部语料库自带的
主题标签作为候选标签[49,50]
计算外部语料库主题抽取得到的“主题-词汇”分布与待解决问题的“主题-词汇”分布的相似度, 进行主题标签迁移[49,50]
  基于外部语料库的主题自动语义标注方法汇总
[1] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[2] 徐戈, 王厚峰 . 自然语言处理中主题模型的发展[J]. 计算机学报, 2011,34(8):1423-1436.
[2] ( Xu Ge, Wang Houfeng . The Development of Topic Models in Natural Language Processing[J]. Chinese Journal of Computers, 2011,34(8):1423-1436.)
[3] Chang J, Gerrish S, Wang C , et al. Reading Tea Leaves: How Humans Interpret Topic Models [C]//Proceedings of the 2009 International Conference on Neural Information Processing Systems. 2009: 288-296.
[4] Mei Q, Shen X, Zhai C X . Automatic Labeling of Multinomial Topic Models [C]// Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007: 490-499.
[5] Allahyari M, Kochut K . Automatic Topic Labeling Using Ontology-Based Topic Models [C]//Proceedings of the 14th International Conference on Machine Learning and Applications, Miami, Florida, USA. IEEE, 2015: 259-264.
[6] Gourru A, Velcin J, Roche M , et al. United We Stand: Using Multiple Strategies for Topic Labeling [C]//Proceedings of the 23rd International Conference on Applications of Natural Language to Information Systems, Paris, France. Springer, 2018: 352-363.
[7] Lau J H, Newman D, Karimi S , et al. Best Topic Word Selection for Topic Labelling [C]//Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. 2010: 605-613.
[8] Basave A E C, He Y, Xu R . Automatic Labelling of Topic Models Learned from Twitter by Summarisation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 618-624.
[9] Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[10] Aletras N, Stevenson M . Representing Topics Using Images [C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 158-167.
[11] Aletras N, Mittal A . Labeling Topics with Images Using a Neural Network [C]//Proceedings of the 39th European Conference on Information Retrieval. Springer, 2017: 500-505.
[12] Aletras N, Baldwin T, Lau J H , et al. Representing Topics Labels for Exploring Digital Libraries [C]// Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries. IEEE, 2014: 239-248.
[13] Aletras N, Baldwin T, Lau J H , et al. Evaluating Topic Representations for Exploring Document Collections[J]. Journal of the Association for Information Science and Technology, 2017,68(1):154-167.
[14] Sorodoc I, Lau J H, Aletras N , et al. Multimodal Topic Labelling [C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 701-706.
[15] Popescul A, Ungar L H . Automatic Labeling of Document Clusters[OL]. [2019-01-10].https://www.cis.upenn.edu/~ungar/Datamining/Publications/labels.pdf.
[16] Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval[M]. Cambridge: Cambridge University Press, 2008: 396-398.
[17] Role F, Nadif M . Beyond Cluster Labeling: Semantic Interpretation of Clusters’ Contents Using a Graph Representation[J]. Knowledge-Based Systems, 2014,56:141-155.
[18] Carmel D, Roitman H, Zwerdling N . Enhancing Cluster Labeling Using Wikipedia [C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009: 139-146.
[19] Tseng Y H . Generic Title Labeling for Clustered Documents[J]. Expert Systems with Applications, 2010,37(3):2247-2254.
doi: 10.1016/j.eswa.2009.07.048
[20] Lau J H, Grieser K, Newman D , et al. Automatic Labelling of Topic Models [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011: 1536-1545.
[21] Hulpus I, Hayes C, Karnstedt M , et al. Unsupervised Graph-Based Topic Labelling Using DBpedia [C]// Proceedings of the 6th ACM International Conference on Web Search and Data Mining. ACM, 2013: 465-474.
[22] Hulpus I, Hayes C, Karnstedt M , et al. An Eigenvalue-Based Measure for Word-Sense Disambiguation [C]// Proceedings of the 25th International Florida Artificial Intelligence Research Society Conference, Marco Island, Florida, USA. 2012.
[23] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 2013 International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[24] Huang P S, He X, Gao J , et al. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data [C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 2333-2338.
[25] Kou W, Li F, Baldwin T . Automatic Labelling of Topic Models Using Word Vectors and Letter Trigram Vectors [C]//Proceedings of the 11th Asia Information Retrieval Societies Conference. Springer, 2015: 253-264.
[26] Cui L, Zhang X, Kimpton A , et al. Automatic Labelling of Topics via Analysis of User Summaries [C]// Proceedings of the 27th Australasian Database Conference. Springer, 2016: 295-307.
[27] Nolasco D, Oliveira J . Detecting Knowledge Innovation Through Automatic Topic Labeling on Scholar Data [C]// Proceedings of the 49th Hawaii International Conference on System Sciences. IEEE, 2016: 358-367.
[28] Atapattu T, Falkner K . A Framework for Topic Generation and Labeling from MOOC Discussions [C]// Proceedings of the 3rd ACM Conference on Learning @Scale. ACM, 2016: 201-204.
[29] Wan X, Wang T . Automatic Labeling of Topic Models Using Text Summaries [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2297-2305.
[30] Mao X L, Ming Z Y, Zha Z J , et al. Automatic Labeling Hierarchical Topics [C]//Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, 2012: 2383-2386.
[31] Magatti D, Calegari S, Ciucci D , et al. Automatic Labeling of Topics [C]//Proceedings of the 9th International Conference on Intelligent Systems Design and Applications. IEEE, 2009: 1227-1232.
[32] Magatti D, Stella F . Probabilistic Topic Discovery and Automatic Document Tagging[A]// Brena R F, Guzman- Arenas A. Quantitative Semantics and Soft Computing Methods for the Web: Perspectives and Applications[M]. IGI Global, 2012: 25-49.
[33] Aletras N, Stevenson M . Labelling Topics Using Unsupervised Graph-based Methods [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 631-636.
[34] Mirzagitova A, Mitrofanova O . Automatic Assignment of Labels in Topic Modelling for Russian Corpora [C]// Proceedings of the 7th Tutorial and Research Workshop on Experimental Linguistics. 2016: 115-118.
[35] Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C]//Proceedings of the 31st International Conference on Machine Learning, Beijing, China. 2014: 1188-1196.
[36] Bhatia S, Lau J H, Baldwin T . Automatic Labelling of Topics with Neural Embeddings[OL]. arXiv Preprint, arXiv: 1612.05340.
[37] Lauscher A, Nanni F, Ruiz Fabo P , et al. Entities as Topic Labels: Combining Entity Linking and Labeled LDA to Improve Topic Interpretability and Evaluability[J]. Italian Journal of Computational Linguistics, 2016,2(2):67-88.
[38] Ramage D, Hall D, Nallapati R , et al. Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora [C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 248-256.
[39] Aker A, Kurtic E, Balamurali A R , et al. A Graph-Based Approach to Topic Clustering for Online Comments to News [C]//Proceedings of the 38th European Conference on Information Retrieval. Springer, 2016: 15-29.
[40] Allahyari M, Pouriyeh S, Kochut K , et al. A Knowledge-Based Topic Modeling Approach for Automatic Topic Labeling[J]. International Journal of Advanced Computer Science & Applications, 2017,8(9):335-349.
[41] Allahyari M, Kochut K . Using Semantically-Extended LDA Topic Model for Semantic Tagging[J]. International Journal of Semantic Computing, 2016,10(4):503-525.
[42] Allahyari M, Kochut K . Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network [C]// Proceedings of the 10th International Conference on Semantic Computing, Laguna Hills, California, USA. IEEE, 2016: 63-70.
[43] Allahyari M, Kochut K . OntoLDA: An Ontology-Based Topic Model for Automatic Topic Labeling[OL]. [2018-11-18].https://datasciencehub.net/system/files/ds-paper-492.pdf.
[44] Adhitama R, Kusumaningrum R, Gernowo R . Topic Labeling Towards News Document Collection Based on Latent Dirichlet Allocation and Ontology [C]//Proceedings of the 1st International Conference on Informatics and Computational Sciences. IEEE, 2017: 247-252.
[45] Davoudi H, An A . Ontology-Based Topic Labeling and Quality Prediction [C]//Proceedings of the 21st International Symposium on Methodologies for Intelligent Systems. Springer, 2015: 171-179.
[46] Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming to Support Analysis of Software Maintenance Activities [C]//Proceedings of the 33rd International Conference on Software Engineering. 2011.
[47] Hindle A, Ernst N A, Godfrey M W , et al. Automated Topic Naming[J]. Empirical Software Engineering, 2013,18(6):1125-1155.
doi: 10.1007/s10664-012-9209-9
[48] Mehdad Y, Carenini G, Ng R T , et al. Towards Topic Labeling with Phrase Entailment and Aggregation [C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013: 179-189.
[49] Herzog A, John P, Mikhaylov S J . Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014[OL]. arXiv Preprint, arXiv: 1806.00793.
[50] Mao X L, Hao Y J, Zhou Q , et al. A Novel Fast Framework for Topic Labeling Based on Similarity-Preserved Hashing [C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers. 2016: 3339-3348.
[51] Chi J, Ouyang J, Li C , et al. Topic Representation: Finding More Representative Words in Topic Models[OL]. arXiv Preprint, arXiv:1810.10307.
[52] Alkhodair S A, Fung B C M, Rahman O , et al. Improving Interpretations of Topic Modeling in Microblogs[J]. Journal of the Association for Information Science and Technology, 2018,69(4):528-540.
[53] Chang S, Dai P, Chen J , et al. Got Many Labels?: Deriving Topic Labels from Multiple Sources for Social Media Posts Using Crowdsourcing and Ensemble Learning [C]// Proceedings of the 24th International Conference on World Wide Web. ACM, 2015: 397-406.
[54] Tang W, Wu X, Li Y , et al. A Topic Label Extraction Method for the University BBS [C]//Proceedings of the 1st International Conference on Data Science in Cyberspace, Changsha, China. IEEE, 2016: 678-682.
[55] 周亦鹏, 杜军平 . 基于关联词的主题模型语义标注[J]. 智能系统学报, 2012,7(4):327-332.
[55] ( Zhou Yipeng, Du Junping . Semantic Tagging of a Topic Model Based on Associated Words[J]. CAAI Transactions on Intelligent Systems, 2012,7(4):327-332.)
[56] Arora S, Liang Y, Ma T . A Simple but Tough-to-Beat Baseline for Sentence Embeddings [C]// Proceedings of the 5th International Conference on Learning Representations. 2017.
[57] Yang Z, Zhu C, Chen W . Zero-Training Sentence Embedding via Orthogonal Basis[OL]. arXiv Preprint, arXiv: 1810.00438.
[1] 沈奎林, 杜瑾. 利用Mashup提升图书馆服务能力——以豆瓣网和南京大学图书馆OPAC结合为例[J]. 现代图书情报技术, 2010, 26(10): 87-90.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn