Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (6): 20-27    DOI: 10.11925/infotech.1003-3513.2016.06.03
Orginal Article Current Issue | Archive | Adv Search |
Using Word2vec with TextRank to Extract Keywords
Ning Jianfei(),Liu Jiangzhen
Department of Electronic Information, Luoding Polytechnic, Luoding 527200, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study extracts keywords through combining the internal structure of each single document and the word vector of the corpus. [Methods] First, we used Word2vec to represent all words’ vector from the document corpus and then calculated their similarities. Second, modified the TextRank algorithm and assigned weights to the keywords in accordance with their similarities and adjacency relations. Finally, we built a probability transfer matrix for the iterative calculation of the lexical graph model and then extracted keywords. [Results] The Word2vec and TextRank were integrated and extracted keywords effectively. [Limitations] The proposed method needs much training with the corpus to establish word vector and relation matrix. [Conclusions] The relationship among words from the document sets could help us modify the words relationship from a single document, and then increase the accuracy of extracting keywords from the individual document.

Key wordsKeyword extraction      Word2vec      TextRank      Graphical model      Word vector     
Received: 01 March 2016      Published: 18 July 2016

Cite this article:

Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords. New Technology of Library and Information Service, 2016, 32(6): 20-27.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.06.03     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I6/20

[1] Mihalcea R, Tarau P.TextRank: Bringing Order into Texts [C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
[2] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[2] (Xia Tian.Study on Keyword Extraction Using Word Position Weighted Text Rank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[3] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7-8): 41-47.
[3] (Gu Yijun, Xia Tian.Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7-8): 41-47.)
[4] Goldberg Y, Levy O. Word2vec Explained: Deriving Mikolov et al. 's Negative-sampling Word-embedding Method [OL]. ArXiv, 2014. arXiv: 1402.3722v1.
[5] Frank E, Paynter G W, Witten I H, et al.Domain-Specific Keyphrase Extraction [C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.
[6] Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[7] 耿焕同, 蔡庆生, 于琨, 等. 一种基于词共现图的文档主题词自动抽取方法[J]. 南京大学学报: 自然科学版, 2006, 42(2): 156-162.
[7] (Geng Huantong, Cai Qingsheng, Yu Kun, et al.A Method Based on the Co-occurrence of Automatic Text Keyphrase Extraction Method[J]. Journal of Nanjing University: Natural Science Edition, 2006, 42(2): 156-162.)
[8] 刘菲, 黄萱菁, 吴立德. 利用关联规则挖掘文本主题词的方法[J]. 计算机工程, 2008, 34(7): 81-83.
[8] (Liu Fei, Huang Xuanjing, Wu Lide.The Method of Using Association Rule Mining Text Topic Words[J]. Computer Engineering, 2010, 27(8): 2853-2856.)
[9] 蒋昌金, 彭宏, 陈建超, 等. 基于组合词和同义词集的关键词提取算法[J]. 计算机应用研究, 2010, 27(8): 2853-2856.
[9] (Jiang Changjin, Peng Hong, Chen Jianchao, et al.Keyword Extraction Algorithm Based on Combination of Words and Synonyms[J]. Computer Application Research, 2010, 27(8): 2853-2856.)
[10] 徐文海, 温有奎. 一种基于TFIDF方法的中文关键词抽取算法[J]. 情报理论与实践, 2008, 31(2): 298-302.
[10] (Xu Wenhai, Wen Youkui.Chinese Keywords Extraction Based on TFIDF Method[J]. Information Studies: Theory & Application, 2008, 31(2): 298-302.)
[11] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[12] 石晶, 李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19): 81-83.
[12] (Shi Jing, Li Wanlong.Topic Words Extraction Method Based on LDA Model[J]. Computer Engineering, 2010, 36(19): 81-83.)
[13] 刘俊, 邹东升, 邢欣来, 等. 基于主题特征的关键词抽取[J]. 计算机应用研究, 2012, 29(11): 4224-4227.
[13] (Liu Jun, Zou Dongsheng, Xing Xinlai, et al.Keyphrase Extraction Based on Topic Feature[J]. Application Research of Computers, 2012, 29(11): 4224-4227.)
[14] 李跃鹏, 金翠, 及俊川. 基于Word2vec的关键词提取算法[J]. 科研信息化技术与应用, 2015(4): 54-59.
[14] (Li Yuepeng, Jin Cui, Ji Junchuan.A Keyword Extraction Algorithm Based on Word2vec[J]. E-science Technology & Application, 2015(4): 54-59.)
[15] 周练. Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015(2): 145-148.
[15] (Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015(2): 145-148.)
[16] Page L, Brin S, Motwani R, et al.The PageRank Citation Ranking: Bringing Order to the Web [R]. Stanford InfoLab, 1999.
[17] Tomas M, Kai C, Greg C, et al. Efficient Estimation of Word Representations in Vector Space [OL]. ArXiv, 2013. arXiv: 1301.3781v3.
[1] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[2] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[3] Dai Zhihong, Hao Xiaoling. Extracting Hypernym-Hyponym Relationship for Financial Market Applications[J]. 数据分析与知识发现, 2021, 5(10): 60-70.
[4] Li Yueyan,Xiong Huixiang,Li Xiaomin. Recommending Doctors Online Based on Combined Conditions[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[5] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[6] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[7] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[8] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[9] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[10] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[11] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[12] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[13] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[14] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[15] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn