Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 101-108    DOI: 10.11925/infotech.2096-3467.2019.1306
Current Issue | Archive | Adv Search |
Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model
Cheng Bin1(),Shi Shuicai1,2,Du Yuncheng1,2,Xiao Shibin1,2
1Computer School, Beijing Information Science & Technology University, Beijing 100185, China
2Beijing TRS Information Technology Co., Ltd., Beijing 100101, China
Download: PDF (761 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Utilizing the advantages of the CRF model to solve the problem of sequence labeling, by incorporating part-of-speech information and the CRF model into the BiLSTM network, automatic extraction of journal keywords is realized. [Methods] The keyword extraction problem is considered as a sequence labeling problem. Pre-processing word segmentation and part-of-speech tagging of journal text; vectorizing the pre-processed text using the Word2Vec model for Word Embedding to obtain vector expressions of words; using BiLSTM-CRF model for automatic keyword extraction. [Results] Using the part-of-speech and BiLSTM-CRF network to perform experiments on the collected China National Knowledge Infrastructure text, the accuracy on Simple Word is improved by 3% compared to the original BiLSTM model. On Complex Word, the accuracy is improved by 12%. [Limitations] The journal keyword extraction model cannot accurately extract complex keywords. In future work, it is necessary to further remind the model of the performance of complex keywords. [Conclusions] Compared with the traditional method, the BiLSTM-CRF model with part-of-speech integration has higher recognition accuracy and is an effective keyword extraction method.

Key wordsExtraction      Conditional Random Field      Deep Learning      Bidirectional Long Short Term Memory     
Received: 06 December 2019      Published: 11 November 2020
ZTFLH:  TP393  
Corresponding Authors: Cheng Bin     E-mail: 1842729609@qq.com

Cite this article:

Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1306     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/101

Keyword Extraction for Journals Based on Part-of-speech and BiLSTM-CRF Combined Model
Skip-gram Model
LSTM Neuron Structure
LSTM Network Structure
BiLSTM Network Structure
BiLSTM-CRF Model
参数名称 参数值 参数名称 参数值
词向量维度 200 隐藏层 128
词性向量维度 10 BiLSTM模型层数 2
Batch_size 100 Dropout值 0.80
学习率 0.001 激活函数 Tanh
Model Parameter Settings
Case Case1 Case2 Case3 Case4 Case5
实验方法 LSTM BiLSTM BiLSTM-CRF 融合词性的BiLSTM 融合词性的BiLSTM-CRF
SW 准确率P(%) 83.72 84.23 84.65 84.52 86.57
召回率R(%) 79.33 81.28 83.74 82.37 85.16
F值(%) 81.50 82.73 84.19 83.43 85.86
CW 准确率P(%) 42.35 47.64 53.26 51.37 61.43
召回率R(%) 36.76 41.35 47.28 43.64 52.83
F值(%) 39.36 44.27 50.09 47.19 56.81
Experimental Results of Different Model Combinations
实验方法 指标 TextRank SGRank SingleRank 融合词性的BiLSTM-CRF
SW 准确率P(%) 45.67 53.63 48.64 86.57
召回率R(%) 43.12 52.14 47.33 85.16
F值(%) 44.36 52.87 47.98 85.86
CW 准确率P(%) 19.87 23.18 19.77 61.43
召回率R(%) 16.24 20.62 17.32 52.83
F值(%) 17.87 21.83 18.46 56.81
Experimental Results of Different Keyword Extraction Methods
[1] Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine[C]// Proceedings of the 7th International Conference on Advances in Web-Age Information Management, Hong Kong,China. Springer-Verlag, 2006: 85-96.
[2] Al-Saleh A B, Menai M E B. Automatic Arabic Text Summarization: A Survey[J]. Artificial Intelligence Review, 2015,45(2):203-234.
[3] Hulth A, Karlgren J, Jonsson A, et al. Automatic Keyword Extraction Using Domain Knowledge[C]// Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 2001: 472-482.
[4] Marujo L, Wang L, Trancoso I, et al. Automatic Keyword Extraction on Twitter[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. ACL, 2015: 637-643.
[5] Gollapalli S D, Li X L, Yang P. Incorporating Expert Knowledge into Keyphrase Extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 2017: 3180-3187.
[6] Li S J, Wang H F, Yu S W, et al. News-Oriented Automatic Chinese Keyword Indexing[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. IEEE, 2003: 92-97.
[7] Wang H F, Li S J, Yu S W, et al. A Combining Approach to Automatic Keyphrases Indexing for Chinese News Documents[C]// Proceedings of the 5th International Conference on Intelligent Text Processing and Computational Linguistics,. Springer, 2004: 441-444.
[8] Rumelhart D E, Hinton G E, Williams R J. Learning Representations by Back-propagating Errors[J]. Nature, 1986,323(6088):533-536.
[9] Medelyan O, Witten I H. Thesaurus-Based Index Term Extraction for Agricultural Documents[C]// Proceedings of the 6th Agricultural Ontology Service Workshop at EFITA/WCCA. IEEE, 2005: 1122-1129.
[10] Peter T. Learning to Extract Keyphrases from Text[R]. National Research Council, 2002.
[11] Kim S N, Kan M Y. Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles[C]// Proceedings of the 2009 ACL-IJCNLP Workshop on Multiword Espesssions.USA:ACL, 2009: 9-16.
[12] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508. 01991.
[13] 王序文, 李姣, 吴英杰, 等. 基于BiLSTM-CRF的中文生物医学开放式概念关系抽取[J]. 中华医学图书情报杂志, 2018,27(11):33-39.
[13] ( Wang Xuwen, Li Jiao, Wu Yingjie, et al. BiLSTM-CRF-Based Open Concept Relation Extraction from Chinese Biomedical Texts[J]. Chinese Journal of Medical Library and Information Science, 2018,27(11):33-39.)
[14] Chen Y, Zhou C J, Li T X, et al. Named Entity Recognition from Chinese Adverse Drug Event Reports with Lexical Feature Based BiLSTM-CRF and Tri-training[J]. Journal of Biomedical Informatics, 2019,96:103252.
doi: 10.1016/j.jbi.2019.103252 pmid: 31323311
[15] 程博, 李卫红, 童昊昕. 基于BiLSTM-CRF的中文层级地址分词[J]. 地球信息科学学报, 2019,21(8):1143-1151.
doi: 10.12082/dqxxkx.2019.180654
[15] ( Cheng Bo, Li Weihong, Tong Haoxin. Chinese Address Segmentation Based on BiLSTM-CRF[J]. Journal of Geo-Information Science, 2019,21(8):1143-1151.)
doi: 10.12082/dqxxkx.2019.180654
[16] Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 2019 World Wide Web Conference. 2019: 2551-2557.
[17] 语言云(语言技术平台云)[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/.
[17] (LTP[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/. )
[18] 宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
[18] ( Ning Jianfei, Liu Jiangzhen. Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
[19] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5-6):602-610.
doi: 10.1016/j.neunet.2005.06.042 pmid: 16112549
[20] 陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018,45(6A):91-96,113.
[20] ( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018,45(6A):91-96,113.)
[21] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNPL, 2004: 404-411.
[22] Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C]// Proceedings of the 4th Joint Conference on Lexical and Computational Semantics. ACL, 2015: 117-126.
[23] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]// Proceedings of the 23rd National Conference on Artificial IntelligenceAAAI, 2008: 855-860.
[1] Hu Shaohu,Zhang Yingyi,Zhang Chengzhi. Review of Keyword Extraction Studies[J]. 数据分析与知识发现, 2021, 5(3): 45-59.
[2] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[3] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[4] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[5] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[6] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[7] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[8] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[9] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[10] Dai Jianhua, Deng Yubin. Extracting Emotion-Cause Pairs Based on Emotional Dilation Gated CNN[J]. 数据分析与知识发现, 2020, 4(8): 98-106.
[11] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[12] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[13] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[14] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[15] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn