Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (7): 76-86    DOI: 10.11925/infotech.2096-3467.2020.0071
Current Issue | Archive | Adv Search |
Extracting Key-phrases from Chinese Scholarly Papers
Xia Tian()
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education,Renmin University of China, Beijing 100872, China;School of Information Resource Management, Renmin University of China, Beijing 100872, China
Download: PDF (984 KB)   HTML ( 15
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper propose a new method to extract key-phrases from Chinese scholarly articles, aiming to provide concept representation at phrase level for academic text mining.[Methods] First, we introduced the cohesion and freedom concepts to measure the internal tightness of phrases and free collocation ability of boundary words. It helped us compute the authority of bi-word phrases. Then, we merged our list with phrases extracted by position-weighted method. Finally, the TopN elements were retrieved as the final key phrases.[Results] We examined the proposed PhraseRank method with Chinese academic papers, and found its precision, recall and R-MAP values were significantly higher than those of the traditional WordRank algorithm. Among them, the R-MAP value increased by more than 128%.[Limitations] Our method could not identify key phrases with three or more words.[Conclusions] The keyphrases extracted by PhraseRank, which are more consistent with manually labeled results than keywords, effectively describe characteristics of Chinese scholarly papers.

Key wordsKey-phrase Extraction      Academic Text Mining      TextRank      Word Graph     
Received: 01 February 2020      Published: 25 July 2020
ZTFLH:  G353  
Corresponding Authors: Xia Tian     E-mail: xiat@ruc.edu.cn

Cite this article:

Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers. Data Analysis and Knowledge Discovery, 2020, 4(7): 76-86.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0071     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I7/76

Implementation Process of Keyphrase Extraction
Word Graph Snippet from Abstract Text
Demonstration of Bi-word Phrase Recognition
构词数量 平均字符长度 出现次数 占比 累计占比
1 3.34 20 303 28.07% 28.07%
2 4.33 39 028 53.95% 82.02%
3 5.95 10 005 13.83% 95.85%
4 7.46 2 142 2.96% 98.81%
5 9.48 476 0.66% 99.47%
6 10.55 218 0.30% 99.77%
7 12.65 79 0.11% 99.88%
8 15.59 37 0.05% 99.93%
9 17.07 14 0.02% 99.95%
10 16.18 22 0.03% 99.98%
其他 - 13 0.02% 100.00%
Key Phrase Statistics in the Dataset
P, R and F Comparison of PharseRank and WordRank
MAP@N of Two Algorithms
算法 MAP@3 MAP@5 MAP@7 MAP@10 R-MAP
WordRank 0.070 0.083 0.087 0.091 0.077
PhraseRank 0.164 0.188 0.201 0.211 0.176
MAP@N and R-MAP Comparison of WorkRank and PhraseRank
文档标题 人工标记 WordRank PhraseRank
面向安全教育的儿童阅读推广研究 图书馆, 儿童阅读推广, 安全 儿童, 推广, 阅读 儿童, 阅读推广, 教育
图书馆电子书馆配研究 馆配市场, 电子书馆配, 图书馆 电子书, 图书馆, 文献 图书馆电子书, 文献, 市场
国外基于情感角度的信息搜寻行为研究进展 情感, 认知, 信息搜寻行为 情感, 信息, 搜寻 情感因素, 影响信息, 搜寻
试析大数据在电子文件管理中的应用 大数据, 电子文件管理 文件, 电子 电子文件, 文件管理
虚实融合的图书馆空间互动服务模式研究 图书馆, 实体空间, 虚拟空间 图书馆, 空间, 服务 图书馆空间, 服务模式, 互动服务
Extract Samples from Documents (R-AP=0)
[1] Chen H H, Treeratpituk P, Mitra P, et al. CSSeer: An Expert Recommendation System Based on CiteseerX[C] //Proceedings of the 13th ACM/IEEE-IC Joint Conference on Digital Libraries (JCDL 2013). 2013: 381-382.
[2] Collins A, Beel J. Document Embeddings vs. Keyphrases vs . Terms for Recommender Systems: A Large-Scale Online Evaluation[C] //Proceedings of the 18th Joint Conference on Digital Libraries (JCDL 2019). 2019: 130-133.
[3] Griffiths T L, Steyvers M. Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004,101(S1):5228-5235.
[4] Papagiannopoulou E, Tsoumakas G. A Review of Keyphrase Extraction[OL]. arXiv Preprint, arXiv:1905.05044.
[5] Sifatullah S, Aditi S. Keyword and Keyphrase Extraction Techniques: A Literature Review[J]. International Journal of Computer Applications, 2015,109(2):18-23.
[6] Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C] //Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
[7] Mahata D, Shah R R, Kuriakose J, et al. Key2Vec: Automatic Ranked Keyphrase Extraction from Scientific Articles Using Phrase Embeddings[C] //Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018). 2018: 634-639.
[8] 赵京胜, 朱巧明, 周国栋, 等. 自动关键词抽取研究综述[J]. 软件学报, 2017,28(9):2431-2449.
[8] ( Zhao Jingsheng, Zhu Qiaoming, Zhou Guodong, et al. Review of Research in Automatic Keyword Extraction[J]. Journal of Software, 2017,28(9):2431-2449.)
[9] Turney P D. Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000,2(4):303-336.
[10] Zhang Y, Xiao W. Keyphrase Generation Based on Deep Seq2seq Model[J]. IEEE Access, 2018,6:46047-46057.
[11] Mothe J, Ramiandrisoa F, Rasolomanana M. Automatic Keyphrase Extraction Using Graph-based Methods[C] //Proceedings of the 33rd Annual ACM Symposium on Applied Computing. 2018: 728-730.
[12] El-Beltagy S R, Rafea A. KP-Miner: A Keyphrase Extraction System for English and Arabic Documents[J]. Information Systems, 2009,34(1):132-144.
[13] Liu Z, Li P, Zheng Y, et al. Clustering to Find Exemplar Terms for Keyphrase Extraction[C] //Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009: 257-266.
[14] Campos R, Mangaravite V, Pasquali A, et al. A Text Feature Based Automatic Keyword Extraction Method for Single Documents[A] //Proceedings of the 40th European Conference on IR Research. 2018: 684-691.
[15] Won M, Martins B, Raimundo F. Automatic Extraction of Relevant Keyphrases for the Study of Issue Competition[C] //Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing. 2019.
[16] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C] //Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[17] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C] //Proceedings of the 23rd AAAI Conference on Artificial Intelligence. 2008: 855-860.
[18] Rose S, Engel D, Cramer N, et al. Automatic Keyword Extraction from Individual Documents[A]// Text Mining: Applications and Theory[M]. Wiley, 2010,1:1-20.
[19] Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C] //Proceedings of the 14th Joint Conference on Lexical and Computational Semantics. 2015: 117-126.
[20] Florescu C, Caragea C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents[C] //Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1105-1115.
[21] 刘啸剑, 谢飞. 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016,35(6):664-672.
[21] ( Liu Xiaojian, Xie Fei. Graph Based Keyphrase Extraction Using LDA Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(6):664-672.)
[22] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9):30-34.
[22] ( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9):30-34.)
[23] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7):41-47.
[23] ( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7):41-47.)
[24] 夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[24] ( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[25] 李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017,36(8):187-191.
[25] ( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017,36(8):187-191.)
[26] 刘竹辰, 陈浩, 于艳华, 等. 词位置分布加权TextRank的关键词提取[J]. 数据分析与知识发现, 2018,2(9):74-79.
[26] ( Liu Zhuchen, Chen Hao, Yu Yanhua, et al. Extracting Keywords with TextRank and Weighted Word Positions[J]. Data Analysis and Knowledge Discovery, 2018,2(9):74-79.)
[27] 孙明珠, 马静, 钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究[J]. 数据分析与知识发现, 2019,3(8):68-76.
[27] ( Sun Mingzhu, Ma Jing, Qian Lingfei. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. Data Analysis and Knowledge Discovery, 2019,3(8):68-76.)
[28] 方俊伟, 崔浩冉, 贺国秀, 等. 基于先验知识TextRank的学术文本关键词抽取[J]. 情报科学, 2019,37(3):77-82.
[28] ( Fang Junwei, Cui Haoran, He Guoxiu, et al. Keyword Extraction of Academic Text with TextRank Model Based on Prior Knowledge[J]. Information Science, 2019,37(3):77-82.)
[1] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[2] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[3] An Wang,Yijun Gu,Kunming Li,Wenzheng Li. Extracting Keywords Based on Removed Network Word Nodes[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[4] Liu Zhuchen,Chen Hao,Yu Yanhua,Li Jie. Extracting Keywords with TextRank and Weighted Word Positions[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[5] Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[6] Xia Tian. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[7] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[8] Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. 现代图书情报技术, 2013, 29(9): 30-34.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn