融合LDA与TextRank的关键词抽取研究

doi:10.11925/infotech.1003-3513.2014.07.06

现代图书情报技术

2014, Vol. 30

Issue (7): 41-47 https://doi.org/10.11925/infotech.1003-3513.2014.07.06

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

融合LDA与TextRank的关键词抽取研究

顾益军¹, 夏天^2,3

1. 中国人民公安大学网络安全保卫学院, 北京100038;
2. 中国人民大学数据工程与知识工程教育部重点实验室, 北京100872;
3. 中国人民大学信息资源管理学院, 北京100872

Study on Keyword Extraction with LDA and TextRank Combination

Gu Yijun¹, Xia Tian^2,3

1. Schools of Cyber Security, People's Public Security University of China, Beijing 100038, China;
2. Key Laboratory of Data Engineering and Knowledge Engineering, MOE, Renmin University of China, Beijing 100872, China;
3. School of Information Resource Management, Renmin University of China, Beijing 100872, China

摘要
参考文献
相关文章
Metrics

全文: PDF (515 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]通过将单一文档内部的结构信息和文档整体的主题信息融合到一起进行关键同抽取。[方法]利用LDA对文档集进行主题建模和候选关键同的主题影响力计算，进而对TextRank算法进行改进，将候选关键同的重要性按照主题影响力和邻接关系进行非均匀传递，并构建新的概率转移矩阵用于同图迭代计算和关键同抽取。[结果]实现LDA与TextRank的有效融合，当数据集呈现较强的主题分布时，可以显著改善关键同抽取效果。[局限]融合方法需要进行代价较高的多文档主题分析。[结论]关键同既与文档本身相关，也与文档所在的文档集合相关、二者结合是改进关键同抽取结果的有效途径。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	顾益军
	夏天

Abstract：

[Objective] Realize keyword extraction through the merger of the internal structure information of single document and the topic information among documents.[Methods] LDA is used for topic modeling and influence calculation of candidate keywords, then, the Text Rank algorithm is improved and the importance of the candidate words is uneven transferred by topic influences and word adjacency relations. Furthermore, the probability transition matrix for iterative calculation is built and used to extract keywords.[Results] The effective combination of LDA and Text Rank is achieved, and the keyword extraction results are improved significantly when the data set presents strong topic distribution.[Limitations] High-cost multi-document topic analysis is required for combination method.[Conclusions] Document keywords are associated with document itself and the related documents collection,combination of these two aspects is an effective way to improve the results of keyword extraction.

Key words： Keyword extraction LDA Text Rank Graph model

收稿日期: 2014-02-07 出版日期: 2014-10-20

TP393

基金资助:

国家社会科学基金项目“Web2.0环境下的网络舆情采集与分析”（项目编号：09CTQ027）和北京高等学校青年英才计划项目“基于链接和主题分析的微博社区挖掘研究”（项目编号：YETP0215）的研究成果之一

通讯作者: 夏天E-mail：xiatian1119@gmail.com E-mail: xiatian1119@gmail.com

作者简介: 作者贡献声明：顾益军：共同提出研究思路，共同设计研究方案，共同起草论文，负责最终版本修订；夏天：共同提出研究思路，共同设计研究方案，共同起草论文，负责数据收集和实验。

引用本文:

顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014, 30(7): 41-47.
Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination. New Technology of Library and Information Service, 2014, 30(7): 41-47.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.07.06 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I7/41

[1] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
[2] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34. (Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[3] Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction[C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.
[4] Turney P D. Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[5] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[6] 石晶,李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19): 81-83. (Shi Jing, Li Wanlong. Topic Words Extraction Method Based on LDA Model[J]. Computer Engineering, 2010, 36(19): 81-83.)
[7] 刘俊, 邹东升, 邢欣来,等. 基于主题特征的关键词抽取[J]. 计算机应用研究, 2012, 29(11):4224-4227. (Liu Jun, Zou Dongsheng, Xing Xinlai, et al. Keyphrase Extraction Based on Topic Feature[J]. Application Research of Computers, 2012, 29(11): 4224-4227.)
[8] 刘知远. 基于文档主题结构的关键词抽取方法研究[D]. 北京: 清华大学, 2011. (Liu Zhiyuan. Research on Keyword Extraction Using Document Topical Structure[D]. Beijing: Tsinghua University, 2011.)
[9] Page L, Brin S, Motwani R, et al. The PageRank Citation Ranking: Bringing Order to the Web[R]. Stanford InfoLab, 1999.
[10] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM, 1999, 46(5): 604-632.
[11] Litvak M, Last M. Graph-Based Keyword Extraction for Single-Document Summarization[C]. In: Proceedings of Workshop Multi-source Multilingual Information Extraction and Summarization (MMIES’08). Stroudsburg: Association for Computational Linguistics, 2008: 17-24.
[12] Steyvers M, Griffiths T. Probabilistic Topic Models[A].//Landauer T, McNamara S D, Kintsch W. Handbook of Latent Semantic Analysis: A Road to Meaning[M]. Lawrence Erlbaum, 2007: 424-440.
[13] 夏天. 中心网页中主题网页链接的自动抽取[J]. 山东大学学报: 理学版, 2012, 47(5): 25-31. (Xia Tian. Automatic Extracting Topic Page Links from Hub Page[J]. Journal of Shandong University: Natural Science, 2012, 47(5): 25-31.)
[14] 夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报: 自然科学版, 2011, 29(1): 133-137. (Xia Tian. Content Extraction of Web Page Based on Extended Label Tree[J]. Journal of Guangxi Normal University: Natural Science Edition, 2011, 29(1): 133-137.)
[15] GitHub. ANSJ[EB/OL].[2014-03-05]. https://github.com/ansjsun/ansj_seg.
[16] Mallet. Topic Modeling[EB/OL].[2014-03-05]. http://mallet. cs.umass.edu/topics.php.
[17] Wang C, Zhang M, Ma S, et al. Automatic Online News Issue Construction in Web Environment[C]. In: Proceedings of the 17th International Conference on World Wide Web(WWW’08). New York: ACM, 2008: 457-466.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	李文娜,张智雄. 基于置信学习的知识库错误检测方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[3]	孙羽, 裘江南. 基于网络分析和文本挖掘的意见领袖影响力研究 [J]. 数据分析与知识发现, 0, (): 1-.
[4]	王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究^*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
[5]	李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[6]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[7]	杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[8]	徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9]	黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展^*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[10]	王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究^*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[11]	阮小芸,廖健斌,李祥,杨阳,李岱峰. 基于人才知识图谱推理的强化学习可解释推荐研究^*[J]. 数据分析与知识发现, 2021, 5(6): 36-50.
[12]	刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法^*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13]	陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法^*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[14]	张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测^*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15]	闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 ^*[J]. 数据分析与知识发现, 2021, 5(4): 80-89.

Viewed

Full text

Abstract

Cited

Shared

Discussed