Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (2): 28-34    DOI: 10.11925/infotech.2096-3467.2017.02.04
Orginal Article Current Issue | Archive | Adv Search |
Extracting Keywords with Modified TextRank Model
Xia Tian()
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University of China, Beijing 100872, China
School of Information Resource Management, Renmin University of China, Beijing 100872, China
Download: PDF (793 KB)   HTML ( 43
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to improve the single document keyword extraction algorithm by adding the world knowledge vector from the Wikipedia to the TextRank model. [Methods] First, we created a new word embedding model based on the Word2Vec model with Wikipedia’s Chinese data. Second, we clustered the nodes of TextRank wordgraph to adjust the voting importance of each cluster. Third, we calculated the random walk probability with additional factors of coverage and location. Finally, we got the node score with iterative computation of the transition matrix, and then selected the Top N words as the needed keywords. [Results] The performance of the new TextRank model was much better than other methods when the Top N value was less than or equal to 7. If we only retrieved three keywords, the F measure reached its maximum value, which was 3.374% higher than the best existing results. When the Top N value was larger than 7, the results were similar to the traditional TextRank method. [Limitations] The computation cost was increased due to the cluster analysis. [Conclusions] The new weighted TextRank model could extract keywords effectively.

Key wordsKeyword Extraction      Word Embedding      TextRank      Word2vec     
Received: 28 October 2016      Published: 27 March 2017
ZTFLH:  G353  

Cite this article:

Xia Tian. Extracting Keywords with Modified TextRank Model. Data Analysis and Knowledge Discovery, 2017, 1(2): 28-34.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.02.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I2/28

TopN = 3 TopN = 5 TopN = 7 TopN = 10
P R F P R F P R F P R F
M1 0.304 0.259 0.277 0.230 0.326 0.267 0.188 0.372 0.247 0.151 0.424 0.221
M2 0.119 0.191 0.143 0.095 0.240 0.131 0.080 0.263 0.116 0.072 0.295 0.107
M3 0.019 0.016 0.017 0.017 0.024 0.020 0.016 0.032 0.021 0.018 0.051 0.027
M4 0.356 0.306 0.326 0.270 0.383 0.313 0.217 0.428 0.284 0.170 0.479 0.249
M5 0.369 0.316 0.337 0.276 0.391 0.320 0.218 0.430 0.286 0.169 0.477 0.247
文档
编号
101037 24576 26808
标注
结果
民企, 军工,
融合
日本侵华, 轮船,
索赔, 陈春
财政部, 金融
高管, 限薪
M1 政府, 公司,
日本
日本, 陈顺通,
陈洽群, 律师
薪酬, 金融机构,
国有
M2 企业, 政府,
日本
陈顺通, 幼子,
上海, 三井
国有, 金融机构,
水平
M3 企业, 政府,
日本
租金, 见证,
航运业, 预定
征求, 相关,
监事长
M4 军火, 企业,
日本
船王, 日本,
陈顺通, 陈洽群
金融机构,
薪酬, 国有
M5 军火, 企业,
政府
船王, 民间,
日本, 陈顺通
金融机构,
国有, 薪酬
[1] Mihalcea R, Tarau P.Textrank: Bringing Order into Texts[C]//Proceedings of Empirical Methods in Natural Language Processing. 2004.
[2] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013 (9): 30-34.
[2] (Xia Tian.Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013 (9): 30-34.)
[3] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014 (7/8): 41-47.
[3] (Gu Yijun, Xia Tian.Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
[4] 李鹏, 王斌, 石志伟, 等. Tag-TextRank: 一种基于Tag的网页关键词抽取方法[J]. 计算机研究与发展, 2012, 49(11): 2344-2351.
[4] (Li Peng, Wang Bin, Shi Zhiwei, et al.Tag-TextRank: A Webpage Keyword Extraction Method Based on Tags[J]. Journal of Computer Research and Development, 2012, 49(11): 2344-2351.)
[5] 谢玮, 沈一, 马永征. 基于图计算的论文审稿自动推荐系统[J]. 计算机应用研究, 2016, 33(3): 798-801.
doi: 10.3969/j.issn.1001-3695.2016.03.035
[5] (Xie Wei, Shen Yi, Ma Yongzheng.Recommendation System for Paper Reviewing Based on Graph Computing[J]. Application Research of Computers, 2016, 33(3): 798-801.)
doi: 10.3969/j.issn.1001-3695.2016.03.035
[6] 李跃鹏, 金翠, 及俊川. 基于Word2vec的关键词提取算法[J]. 科研信息化技术与应用, 2015, 6(4): 54-59.
doi: 10.11871/j.issn.1674-9480.2015.04.007
[6] (Li Yuepeng, Jin Cui, Ji Junchuan.A Keyword Extraction Algorithm Based on Word2vec[J]. e-Science Technology & Application, 2015,6(4): 54-59.)
doi: 10.11871/j.issn.1674-9480.2015.04.007
[7] 宁建飞, 刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016 (6): 20-27.
[7] (Ning Jianfei, Liu Jiangzhen.Using Word2vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6): 20-27.)
[8] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of Workshop at International Conference on Learning Representations. 2013.
[9] Ansj Lexical Parser [EB/OL]. [2016-10-01]..
[10] Deep Learning with Word2vec [EB/OL]. [2016-10-01]. .
[1] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[2] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[3] Wei Tingxin,Bai Wenlei,Qu Weiguang. Sense Prediction for Chinese OOV Based on Word Embedding and Semantic Knowledge[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[4] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[5] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[6] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[7] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[8] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[9] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[10] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[11] Xinyu Zai,Xuedong Tian. Retrieving Scientific Documents with Formula Description Structure and Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 131-138.
[12] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[13] Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[14] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[15] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn