Study on Keyword Extraction with LDA and TextRank Combination

doi:10.11925/infotech.1003-3513.2014.07.06

New Technology of Library and Information Service

2014, Vol. 30

Issue (7): 41-47 DOI: 10.11925/infotech.1003-3513.2014.07.06

Current Issue | Archive | Adv Search

Study on Keyword Extraction with LDA and TextRank Combination

Gu Yijun¹, Xia Tian^2,3

1. Schools of Cyber Security, People's Public Security University of China, Beijing 100038, China;
2. Key Laboratory of Data Engineering and Knowledge Engineering, MOE, Renmin University of China, Beijing 100872, China;
3. School of Information Resource Management, Renmin University of China, Beijing 100872, China

Download:
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] Realize keyword extraction through the merger of the internal structure information of single document and the topic information among documents.[Methods] LDA is used for topic modeling and influence calculation of candidate keywords, then, the Text Rank algorithm is improved and the importance of the candidate words is uneven transferred by topic influences and word adjacency relations. Furthermore, the probability transition matrix for iterative calculation is built and used to extract keywords.[Results] The effective combination of LDA and Text Rank is achieved, and the keyword extraction results are improved significantly when the data set presents strong topic distribution.[Limitations] High-cost multi-document topic analysis is required for combination method.[Conclusions] Document keywords are associated with document itself and the related documents collection,combination of these two aspects is an effective way to improve the results of keyword extraction.

Key words： Keyword extraction LDA Text Rank Graph model

Received: 07 February 2014 Published: 20 October 2014

TP393

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Gu Yijun
	Xia Tian

Cite this article:

Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination. New Technology of Library and Information Service, 2014, 30(7): 41-47.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.07.06 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I7/41

[1] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
[2] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34. (Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[3] Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction[C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.
[4] Turney P D. Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[5] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[6] 石晶,李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19): 81-83. (Shi Jing, Li Wanlong. Topic Words Extraction Method Based on LDA Model[J]. Computer Engineering, 2010, 36(19): 81-83.)
[7] 刘俊, 邹东升, 邢欣来,等. 基于主题特征的关键词抽取[J]. 计算机应用研究, 2012, 29(11):4224-4227. (Liu Jun, Zou Dongsheng, Xing Xinlai, et al. Keyphrase Extraction Based on Topic Feature[J]. Application Research of Computers, 2012, 29(11): 4224-4227.)
[8] 刘知远. 基于文档主题结构的关键词抽取方法研究[D]. 北京: 清华大学, 2011. (Liu Zhiyuan. Research on Keyword Extraction Using Document Topical Structure[D]. Beijing: Tsinghua University, 2011.)
[9] Page L, Brin S, Motwani R, et al. The PageRank Citation Ranking: Bringing Order to the Web[R]. Stanford InfoLab, 1999.
[10] Kleinberg J M. Authoritative Sources in a Hyperlinked Environment[J]. Journal of the ACM, 1999, 46(5): 604-632.
[11] Litvak M, Last M. Graph-Based Keyword Extraction for Single-Document Summarization[C]. In: Proceedings of Workshop Multi-source Multilingual Information Extraction and Summarization (MMIES’08). Stroudsburg: Association for Computational Linguistics, 2008: 17-24.
[12] Steyvers M, Griffiths T. Probabilistic Topic Models[A].//Landauer T, McNamara S D, Kintsch W. Handbook of Latent Semantic Analysis: A Road to Meaning[M]. Lawrence Erlbaum, 2007: 424-440.
[13] 夏天. 中心网页中主题网页链接的自动抽取[J]. 山东大学学报: 理学版, 2012, 47(5): 25-31. (Xia Tian. Automatic Extracting Topic Page Links from Hub Page[J]. Journal of Shandong University: Natural Science, 2012, 47(5): 25-31.)
[14] 夏天. 基于扩展标记树的网页正文抽取[J]. 广西师范大学学报: 自然科学版, 2011, 29(1): 133-137. (Xia Tian. Content Extraction of Web Page Based on Extended Label Tree[J]. Journal of Guangxi Normal University: Natural Science Edition, 2011, 29(1): 133-137.)
[15] GitHub. ANSJ[EB/OL].[2014-03-05]. https://github.com/ansjsun/ansj_seg.
[16] Mallet. Topic Modeling[EB/OL].[2014-03-05]. http://mallet. cs.umass.edu/topics.php.
[17] Wang C, Zhang M, Ma S, et al. Automatic Online News Issue Construction in Web Environment[C]. In: Proceedings of the 17th International Conference on World Wide Web(WWW’08). New York: ACM, 2008: 457-466.

[1]	Shan Xiaohong,Wang Chunwen,Liu Xiaoyan,Han Shengxi,Yang Juan. Identifying Lead Users in Open Innovation Community from Knowledge-based Perspectives[J]. 数据分析与知识发现, 2021, 5(9): 85-96.
[2]	Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[3]	Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[4]	Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[5]	Wang Wei, Gao Ning, Xu Yuting, Wang Hongwei. Topic Evolution of Online Reviews for Crowdfunding Campaigns[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[6]	Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
[7]	Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[8]	Ye Guanghui,Zeng Jieyan,Hu Jinglan,Bi Chongwu. Analyzing Public Sentiments from the Perspective of City Profiles[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[9]	Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[10]	Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[11]	Huang Wei,Zhao Jiangyuan,Yan Lu. Empirical Research on Topic Drift Index for Trending Network Events[J]. 数据分析与知识发现, 2020, 4(11): 92-101.
[12]	Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[13]	Wang Xiwei,Zhang Liu,Huang Bo,Wei Ya’nan. Constructing Topic Graph for Weibo Users Based on LDA: Case Study of “Egypt Air Disaster”[J]. 数据分析与知识发现, 2020, 4(10): 47-57.
[14]	Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[15]	Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.

Viewed

Full text

Abstract

Cited

Shared

Discussed