Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (8): 68-76    DOI: 10.11925/infotech.2096-3467.2018.0765
Current Issue | Archive | Adv Search |
Extracting Keywords Based on Topic Structure and Word Diagram Iteration
Mingzhu Sun,Jing Ma(),Lingfei Qian
School of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
Download: PDF (598 KB)   HTML ( 12
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper integrates the topic information to the TextRank model, aiming to improve the precision and recall of automatic keyword extraction. [Methods] First, we used the LDA to create a model for document topics, and obtained the topic distribution of the candidate keywords. Then, we calculated the node weights with the topic-word probability distribution features. Third, we weighted the probability distributions of document-topic and topic-word characteristics as the node’s random jump probability. Finally, we constructed a new transition matrix for word graph iteration to improve the TextRank model. [Results] We examined the proposed model with 1559 news articles from the website of Southern Weekly. When the number of extracted keywords was three, the model’s keyword extraction precision values were 4.7% and 6.5% higher than those of the original TextRank and TF-IDF algorithms. [Limitations] The fusion algorithm increased computational complexity. [Conclusions] The proposed algorithm could extract keywords more effectively.

Key wordsKeywords Extraction      TextRank      Latent Dirichlet Allocation      Graph Model     
Received: 15 July 2018      Published: 29 September 2019
ZTFLH:  TP393 G35  
Corresponding Authors: Jing Ma     E-mail: majing5525@126.com

Cite this article:

Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration. Data Analysis and Knowledge Discovery, 2019, 3(8): 68-76.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0765     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I8/68

K P R F1
50 0.248 0.209 0.224
100 0.251 0.211 0.226
150 0.252 0.212 0.228
200 0.250 0.211 0.226
TopN TF-IDF TextRank LDA 文献[13]方法 文献[14]方法 本文算法
P R F1 P R F1 P R F1 P R F1 P R F1 P R F1
3 0.213 0.182 0.196 0.231 0.194 0.211 0.243 0.203 0.221 0.245 0.206 0.224 0.248 0.211 0.228 0.278 0.239 0.257
5 0.163 0.23 0.191 0.175 0.244 0.204 0.191 0.256 0.219 0.203 0.256 0.226 0.213 0.282 0.243 0.216 0.289 0.247
7 0.135 0.264 0.179 0.141 0.274 0.186 0.162 0.289 0.208 0.169 0.293 0.214 0.183 0.325 0.234 0.181 0.323 0.232
9 0.116 0.291 0.166 0.12 0.299 0.171 0.135 0.318 0.190 0.145 0.324 0.200 0.162 0.357 0.223 0.159 0.351 0.219
15 0.083 0.343 0.134 0.083 0.344 0.134 0.102 0.362 0.159 0.106 0.375 0.165 0.124 0.411 0.191 0.119 0.399 0.183
文档编号 抽取方法 关键词
1 2 3 4 5
1 TextRank 宜昌 郭有明 部门 透露 知情人
本文算法 郭有明 副省长 报道 涉嫌 违纪违法
2 TextRank 幼儿 装修 幼儿园 咳嗽 皮肤
本文算法 幼儿园 甲醛 装修 咳嗽 过敏
3 TextRank 公司 丁羽心 人民币, 刘志军 有限公司
本文算法 丁羽心 刘志军 并处 有限公司, 非法经营
4 TextRank HPV疫苗 宫颈癌 接种 默沙东 试验
本文算法 HPV疫苗 宫颈癌 临床试验 中国 上市
5 TextRank 报道 衡阳市 破坏选举 衡阳 人大代表
本文算法 破坏选举 衡阳 人大代表 涉嫌 立案
[1] 赵京胜, 朱巧明, 周国栋 , 等. 自动关键词抽取研究综述[J]. 软件学报, 2017,28(9):2431-2449.
[1] ( Zhao Jingsheng, Zhu Qiaoming, Zhou Guodong , et al. Review of Research in Automatic Keyword Extraction[J]. Journal of Software, 2017,28(9):2431-2449.)
[2] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[3] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003,3:993-1022.
[4] Turney P D . Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000,2(4):303-336.
[5] Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction [C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999: 668-673.
[6] 钱爱兵, 江岚 . 基于改进TF-IDF的中文网页关键词抽取——以新闻网页为例[J]. 情报理论与实践, 2008,31(6):945-950.
[6] ( Qian Aibing, Jiang Lan . Chinese Webpage Keyword Extraction Based on Improved TF-IDF—Taking News Webpage as an Example[J]. Information Studies: Theory & Application, 2008,31(6):945-950.)
[7] 杨凯艳 . 基于改进的TFIDF关键词自动提取算法研究[D]. 湘潭: 湘潭大学, 2015.
[7] ( Yang Kaiyan . Research on Automatic Keyword Extraction Algorithm Based on Improved TFIDF[D]. Xiangtan: Xiangtan University, 2015.)
[8] 朱泽德, 李淼, 张健 , 等. 一种基于LDA模型的关键词抽取方法[J]. 中南大学学报: 自然科学版, 2015,46(6):2142-2148.
[8] ( Zhu Zede, Li Miao, Zhang Jian , et al. A LDA-Based Approach to Keyphrase Extraction[J]. Journal of Central South University: Science and Technology, 2015,46(6):2142-2148.)
[9] 丁卓冶 . 面向主题的关键词抽取方法研究[D]. 上海: 复旦大学, 2013.
[9] ( Ding Zhuoye . Research on Keyword Extraction Methods for Topics[D]. Shanghai: Fudan University, 2013.)
[10] 夏天 . 词语位置加权TextRank的关键词抽取研究[J].现代图书情报技术, 2013(9):30-34.
[10] ( Xia Tian . Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9):30-34.)
[11] 夏天 . 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[11] ( Xia Tian . Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[12] Bougouin A, Boudin F, Daille B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction [C]// Proceedings of the 2013 International Joint Conference on Natural Language Processing. 2013: 543-551.
[13] 顾益军, 夏天 . 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7-8):41-47.
[13] ( Gu Yijun, Xia Tian . Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7-8):41-47.)
[14] 刘啸剑, 谢飞, 吴信东 . 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016,35(6):664-672.
[14] ( Liu Xiaojian, Xie Fei, Wu Xindong . Graph Based Keyphrase Extraction Using LDA Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(6):664-672.)
[15] Liu Z, Huang W, Zheng Y, et al. Automatic Keyphrase Extraction via Topic Decomposition [C]// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 2010: 366-376.
[1] Shan Xiaohong,Wang Chunwen,Liu Xiaoyan,Han Shengxi,Yang Juan. Identifying Lead Users in Open Innovation Community from Knowledge-based Perspectives[J]. 数据分析与知识发现, 2021, 5(9): 85-96.
[2] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[3] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[4] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[5] Shen Zhihong,Zhao Zihao,Wang Haibo. Big Data Technology Stack Shifting: From SQL Centric to Graph Centric[J]. 数据分析与知识发现, 2020, 4(7): 50-65.
[6] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[7] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[8] Zhen Zhang,Jin Zeng. Extracting Keywords from User Comments: Case Study of Meituan[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[9] An Wang,Yijun Gu,Kunming Li,Wenzheng Li. Extracting Keywords Based on Removed Network Word Nodes[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[10] Yuman Li,Zhibo Chen,Fu Xu. Classifying Texts with KACC Model[J]. 数据分析与知识发现, 2019, 3(10): 89-97.
[11] Liu Zhuchen,Chen Hao,Yu Yanhua,Li Jie. Extracting Keywords with TextRank and Weighted Word Positions[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[12] Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[13] Xia Tian. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[14] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[15] Hong Ma, Yongming Cai. A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature[J]. 数据分析与知识发现, 2016, 32(12): 17-26.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn