Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (7): 76-86     https://doi.org/10.11925/infotech.2096-3467.2020.0071
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向中文学术文本的单文档关键短语抽取 *
夏天()
中国人民大学数据工程与知识工程教育部重点实验室 北京 100872;中国人民大学信息资源管理学院 北京 100872
Extracting Key-phrases from Chinese Scholarly Papers
Xia Tian()
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China, Beijing 100872, China;School of Information Resource Management, Renmin University of China, Beijing 100872, China
全文: PDF (984 KB)   HTML ( 15
输出: BibTeX | EndNote (RIS)      
摘要 

目的】自动抽取中文学术文本中的关键短语,为学术文本挖掘提供短语级别的概念表达。【方法】引入内部凝聚度和边界自由度两个指标,分别度量短语内部的紧密程度和短语边界的自由组配能力,实现中文双词短语的权威度计算,并与位置加权关键词抽取结果进行融合排序,在此基础上选取TopN个元素生成关键短语。【结果】在构建的中文学术论文数据集上,关键短语抽取算法PhraseRank在准确率、召回率和考虑排序位置的R-MAP评价指标方面,均大幅度优于传统的关键词抽取算法WordRank,其中,R-MAP值相对提升超过了128%。【局限】 未识别三个及以上词语构成的关键短语。【结论】相比于关键词,PhraseRank抽取得到的关键短语,与人工标记结果的一致性更高,更能体现中文学术文本的概念表达特点。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
夏天
关键词 关键短语抽取学术文本挖掘TextRank词图    
Abstract

[Objective] This paper propose a new method to extract key-phrases from Chinese scholarly articles, aiming to provide concept representation at phrase level for academic text mining.[Methods] First, we introduced the cohesion and freedom concepts to measure the internal tightness of phrases and free collocation ability of boundary words. It helped us compute the authority of bi-word phrases. Then, we merged our list with phrases extracted by position-weighted method. Finally, the TopN elements were retrieved as the final key phrases.[Results] We examined the proposed PhraseRank method with Chinese academic papers, and found its precision, recall and R-MAP values were significantly higher than those of the traditional WordRank algorithm. Among them, the R-MAP value increased by more than 128%.[Limitations] Our method could not identify key phrases with three or more words.[Conclusions] The keyphrases extracted by PhraseRank, which are more consistent with manually labeled results than keywords, effectively describe characteristics of Chinese scholarly papers.

Key wordsKey-phrase Extraction    Academic Text Mining    TextRank    Word Graph
收稿日期: 2020-02-01      出版日期: 2020-07-25
ZTFLH:  G353  
基金资助:*本文系国家社会科学基金重大项目“大数据环境下政务信息资源归档与管理研究”的研究成果之一(17ZDA293)
通讯作者: 夏天     E-mail: xiat@ruc.edu.cn
引用本文:   
夏天. 面向中文学术文本的单文档关键短语抽取 *[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers. Data Analysis and Knowledge Discovery, 2020, 4(7): 76-86.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0071      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I7/76
Fig.1  关键短语抽取的实现流程
Fig.2  论文摘要文本构成的词图片断
Fig.3  双词短语识别示意图
构词数量 平均字符长度 出现次数 占比 累计占比
1 3.34 20 303 28.07% 28.07%
2 4.33 39 028 53.95% 82.02%
3 5.95 10 005 13.83% 95.85%
4 7.46 2 142 2.96% 98.81%
5 9.48 476 0.66% 99.47%
6 10.55 218 0.30% 99.77%
7 12.65 79 0.11% 99.88%
8 15.59 37 0.05% 99.93%
9 17.07 14 0.02% 99.95%
10 16.18 22 0.03% 99.98%
其他 - 13 0.02% 100.00%
Table 1  数据集中关键短语的统计信息
Fig.4  PharseRank与WordRank的PRF值对比
Fig.5  两种算法的MAP@N变化情况
算法 MAP@3 MAP@5 MAP@7 MAP@10 R-MAP
WordRank 0.070 0.083 0.087 0.091 0.077
PhraseRank 0.164 0.188 0.201 0.211 0.176
Table 2  WordRank、PhraseRank的MAP@NR-MAP对比
文档标题 人工标记 WordRank PhraseRank
面向安全教育的儿童阅读推广研究 图书馆, 儿童阅读推广, 安全 儿童, 推广, 阅读 儿童, 阅读推广, 教育
图书馆电子书馆配研究 馆配市场, 电子书馆配, 图书馆 电子书, 图书馆, 文献 图书馆电子书, 文献, 市场
国外基于情感角度的信息搜寻行为研究进展 情感, 认知, 信息搜寻行为 情感, 信息, 搜寻 情感因素, 影响信息, 搜寻
试析大数据在电子文件管理中的应用 大数据, 电子文件管理 文件, 电子 电子文件, 文件管理
虚实融合的图书馆空间互动服务模式研究 图书馆, 实体空间, 虚拟空间 图书馆, 空间, 服务 图书馆空间, 服务模式, 互动服务
Table 3  R-AP=0文档的抽取结果示例
[1] Chen H H, Treeratpituk P, Mitra P, et al. CSSeer: An Expert Recommendation System Based on CiteseerX[C] //Proceedings of the 13th ACM/IEEE-IC Joint Conference on Digital Libraries (JCDL 2013). 2013: 381-382.
[2] Collins A, Beel J. Document Embeddings vs. Keyphrases vs . Terms for Recommender Systems: A Large-Scale Online Evaluation[C] //Proceedings of the 18th Joint Conference on Digital Libraries (JCDL 2019). 2019: 130-133.
[3] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004,101(S1):5228-5235.
[4] Papagiannopoulou E, Tsoumakas G. A Review of Keyphrase Extraction[OL]. arXiv Preprint, arXiv:1905.05044.
[5] Sifatullah S, Aditi S. Keyword and Keyphrase Extraction Techniques: A Literature Review[J]. International Journal of Computer Applications, 2015,109(2):18-23.
[6] Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C] //Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
[7] Mahata D, Shah R R, Kuriakose J, et al. Key2Vec: Automatic Ranked Keyphrase Extraction from Scientific Articles Using Phrase Embeddings[C] //Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). 2018: 634-639.
[8] 赵京胜, 朱巧明, 周国栋, 等. 自动关键词抽取研究综述[J]. 软件学报, 2017,28(9):2431-2449.
[8] ( Zhao Jingsheng, Zhu Qiaoming, Zhou Guodong, et al. Review of Research in Automatic Keyword Extraction[J]. Journal of Software, 2017,28(9):2431-2449.)
[9] Turney P D. Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000,2(4):303-336.
[10] Zhang Y, Xiao W. Keyphrase Generation Based on Deep Seq2seq Model[J]. IEEE Access, 2018,6:46047-46057.
[11] Mothe J, Ramiandrisoa F, Rasolomanana M. Automatic Keyphrase Extraction Using Graph-based Methods[C] //Proceedings of the 33rd Annual ACM Symposium on Applied Computing. 2018: 728-730.
[12] El-Beltagy S R, Rafea A. KP-Miner: A Keyphrase Extraction System for English and Arabic Documents[J]. Information Systems, 2009,34(1):132-144.
[13] Liu Z, Li P, Zheng Y, et al. Clustering to Find Exemplar Terms for Keyphrase Extraction[C] //Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 257-266.
[14] Campos R, Mangaravite V, Pasquali A, et al. A Text Feature Based Automatic Keyword Extraction Method for Single Documents[A] //Proceedings of the 40th European Conference on IR Research. 2018: 684-691.
[15] Won M, Martins B, Raimundo F. Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition[C] //Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing. 2019.
[16] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C] //Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[17] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C] //Proceedings of the 23rd AAAI Conference on Artificial Intelligence. 2008: 855-860.
[18] Rose S, Engel D, Cramer N, et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. Wiley, 2010,1:1-20.
[19] Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C] //Proceedings of the 14th Joint Conference on Lexical and Computational Semantics. 2015: 117-126.
[20] Florescu C, Caragea C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents[C] //Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1105-1115.
[21] 刘啸剑, 谢飞. 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016,35(6):664-672.
[21] ( Liu Xiaojian, Xie Fei. Graph Based Keyphrase Extraction Using LDA Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(6):664-672.)
[22] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9):30-34.
[22] ( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9):30-34.)
[23] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7):41-47.
[23] ( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7):41-47.)
[24] 夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[24] ( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[25] 李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017,36(8):187-191.
[25] ( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017,36(8):187-191.)
[26] 刘竹辰, 陈浩, 于艳华, 等. 词位置分布加权TextRank的关键词提取[J]. 数据分析与知识发现, 2018,2(9):74-79.
[26] ( Liu Zhuchen, Chen Hao, Yu Yanhua, et al. Extracting Keywords with TextRank and Weighted Word Positions[J]. Data Analysis and Knowledge Discovery, 2018,2(9):74-79.)
[27] 孙明珠, 马静, 钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究[J]. 数据分析与知识发现, 2019,3(8):68-76.
[27] ( Sun Mingzhu, Ma Jing, Qian Lingfei. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. Data Analysis and Knowledge Discovery, 2019,3(8):68-76.)
[28] 方俊伟, 崔浩冉, 贺国秀, 等. 基于先验知识TextRank的学术文本关键词抽取[J]. 情报科学, 2019,37(3):77-82.
[28] ( Fang Junwei, Cui Haoran, He Guoxiu, et al. Keyword Extraction of Academic Text with TextRank Model Based on Prior Knowledge[J]. Information Science, 2019,37(3):77-82.)
[1] 闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 *[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[2] 孙明珠,马静,钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[3] 王安,顾益军,李坤明,李文政. 基于复杂网络词节点移除的关键词抽取方法 *[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[4] 刘竹辰, 陈浩, 于艳华, 李劼. 词位置分布加权TextRank的关键词提取*[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[5] 王子璇, 乐小虬, 何远标. 基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[6] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[7] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[8] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013, 29(9): 30-34.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn