融合Word2vec与TextRank的关键词抽取研究

doi:10.11925/infotech.1003-3513.2016.06.03

现代图书情报技术

2016, Vol. 32

Issue (6): 20-27 https://doi.org/10.11925/infotech.1003-3513.2016.06.03

研究论文

本期目录 | 过刊浏览 | 高级检索

融合Word2vec与TextRank的关键词抽取研究

宁建飞(

),刘降珍

罗定职业技术学院电子信息系罗定 527200

Using Word2vec with TextRank to Extract Keywords

Ning Jianfei(

),Liu Jiangzhen

Department of Electronic Information, Luoding Polytechnic, Luoding 527200, China

摘要
参考文献
补充材料
相关文章
Metrics

全文: PDF (532 KB) HTML ( 101 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】通过融合单个文档内部结构信息和文档整体的词向量关系进行关键词抽取。【方法】利用Word2vec将文档集中所有词汇进行向量表征, 并且通过词向量计算词汇之间的相似度, 进而对TextRank算法进行改进, 将候选关键词的权重按照词汇之间的相似度和邻接关系进行非均匀分配, 并构建对应的概率转移矩阵用于词汇图模型的迭代计算以及关键词抽取。【结果】实现Word2vec与TextRank的有效融合, 且当训练文档集词汇分布合理时, 关键词抽取效果较明显。【局限】需要进行成本较高的文档集训练, 获取词向量以及词关系矩阵。【结论】文档集中的词关系有助于修正单文档内部的词关系, 提升单文档的关键词抽取准确性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	宁建飞
	刘降珍

关键词 ：抽取, Word2vec, TextRank, 图模型, 词向量

Abstract：

[Objective] This study extracts keywords through combining the internal structure of each single document and the word vector of the corpus. [Methods] First, we used Word2vec to represent all words’ vector from the document corpus and then calculated their similarities. Second, modified the TextRank algorithm and assigned weights to the keywords in accordance with their similarities and adjacency relations. Finally, we built a probability transfer matrix for the iterative calculation of the lexical graph model and then extracted keywords. [Results] The Word2vec and TextRank were integrated and extracted keywords effectively. [Limitations] The proposed method needs much training with the corpus to establish word vector and relation matrix. [Conclusions] The relationship among words from the document sets could help us modify the words relationship from a single document, and then increase the accuracy of extracting keywords from the individual document.

Key words： Keyword extraction Word2vec TextRank Graphical model Word vector

收稿日期: 2016-03-01 出版日期: 2016-07-18

引用本文:

宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords. New Technology of Library and Information Service, 2016, 32(6): 20-27.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.06.03 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I6/20

[1]	Mihalcea R, Tarau P.TextRank: Bringing Order into Texts [C]. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. 2004: 404-411.
[2]	夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[2]	(Xia Tian.Study on Keyword Extraction Using Word Position Weighted Text Rank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[3]	顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7-8): 41-47.
[3]	(Gu Yijun, Xia Tian.Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7-8): 41-47.)
[4]	Goldberg Y, Levy O. Word2vec Explained: Deriving Mikolov et al. 's Negative-sampling Word-embedding Method [OL]. ArXiv, 2014. arXiv: 1402.3722v1.
[5]	Frank E, Paynter G W, Witten I H, et al.Domain-Specific Keyphrase Extraction [C]. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco: Morgan Kaufmann Publishers Inc., 1999: 668-673.
[6]	Turney P D.Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000, 2(4): 303-336.
[7]	耿焕同, 蔡庆生, 于琨, 等. 一种基于词共现图的文档主题词自动抽取方法[J]. 南京大学学报: 自然科学版, 2006, 42(2): 156-162.
[7]	(Geng Huantong, Cai Qingsheng, Yu Kun, et al.A Method Based on the Co-occurrence of Automatic Text Keyphrase Extraction Method[J]. Journal of Nanjing University: Natural Science Edition, 2006, 42(2): 156-162.)
[8]	刘菲, 黄萱菁, 吴立德. 利用关联规则挖掘文本主题词的方法[J]. 计算机工程, 2008, 34(7): 81-83.
[8]	(Liu Fei, Huang Xuanjing, Wu Lide.The Method of Using Association Rule Mining Text Topic Words[J]. Computer Engineering, 2010, 27(8): 2853-2856.)
[9]	蒋昌金, 彭宏, 陈建超, 等. 基于组合词和同义词集的关键词提取算法[J]. 计算机应用研究, 2010, 27(8): 2853-2856.
[9]	(Jiang Changjin, Peng Hong, Chen Jianchao, et al.Keyword Extraction Algorithm Based on Combination of Words and Synonyms[J]. Computer Application Research, 2010, 27(8): 2853-2856.)
[10]	徐文海, 温有奎. 一种基于TFIDF方法的中文关键词抽取算法[J]. 情报理论与实践, 2008, 31(2): 298-302.
[10]	(Xu Wenhai, Wen Youkui.Chinese Keywords Extraction Based on TFIDF Method[J]. Information Studies: Theory & Application, 2008, 31(2): 298-302.)
[11]	Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[12]	石晶, 李万龙. 基于LDA模型的主题词抽取方法[J]. 计算机工程, 2010, 36(19): 81-83.
[12]	(Shi Jing, Li Wanlong.Topic Words Extraction Method Based on LDA Model[J]. Computer Engineering, 2010, 36(19): 81-83.)
[13]	刘俊, 邹东升, 邢欣来, 等. 基于主题特征的关键词抽取[J]. 计算机应用研究, 2012, 29(11): 4224-4227.
[13]	(Liu Jun, Zou Dongsheng, Xing Xinlai, et al.Keyphrase Extraction Based on Topic Feature[J]. Application Research of Computers, 2012, 29(11): 4224-4227.)
[14]	李跃鹏, 金翠, 及俊川. 基于Word2vec的关键词提取算法[J]. 科研信息化技术与应用, 2015(4): 54-59.
[14]	(Li Yuepeng, Jin Cui, Ji Junchuan.A Keyword Extraction Algorithm Based on Word2vec[J]. E-science Technology & Application, 2015(4): 54-59.)
[15]	周练. Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015(2): 145-148.
[15]	(Zhou Lian.Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015(2): 145-148.)
[16]	Page L, Brin S, Motwani R, et al.The PageRank Citation Ranking: Bringing Order to the Web [R]. Stanford InfoLab, 1999.
[17]	Tomas M, Kai C, Greg C, et al. Efficient Estimation of Word Representations in Vector Space [OL]. ArXiv, 2013. arXiv: 1301.3781v3.

[1]		Download
[2]		Download

[1]	单晓红,王春稳,刘晓燕,韩晟熙,杨娟. 开放式创新社区领先用户识别——知识基础观视角*[J]. 数据分析与知识发现, 2021, 5(9): 85-96.
[2]	王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[3]	马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法^*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4]	柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取^*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[5]	谭荧, 唐亦非. 基于指代消解的引文内容抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[6]	张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究^*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[7]	喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[8]	赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[9]	陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[10]	王义真,欧石燕,陈金菊. 民事裁判文书两阶段式自动摘要研究^*[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
[11]	闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 ^*[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[12]	石湘,刘萍. *基于知识元语义描述模型的领域知识抽取与表示研究 ^——以信息检索领域为例**[J]. 数据分析与知识发现, 2021, 5(4): 123-133.
[13]	成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[14]	胡少虎,张颖怡,章成志. 关键词提取研究综述^*[J]. 数据分析与知识发现, 2021, 5(3): 45-59.
[15]	戴志宏, 郝晓玲. 上下位关系抽取方法及其在金融市场的应用^*[J]. 数据分析与知识发现, 2021, 5(10): 60-70.

Viewed

Full text

Abstract

Cited

Shared

Discussed