Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 101-108    DOI: 10.11925/infotech.2096-3467.2019.1306
Current Issue | Archive | Adv Search |
Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model
Cheng Bin1(),Shi Shuicai1,2,Du Yuncheng1,2,Xiao Shibin1,2
1Computer School, Beijing Information Science & Technology University, Beijing 100185, China
2Beijing TRS Information Technology Co., Ltd., Beijing 100101, China
Download: PDF (761 KB)   HTML ( 28
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Utilizing the advantages of the CRF model to solve the problem of sequence labeling, by incorporating part-of-speech information and the CRF model into the BiLSTM network, automatic extraction of journal keywords is realized. [Methods] The keyword extraction problem is considered as a sequence labeling problem. Pre-processing word segmentation and part-of-speech tagging of journal text; vectorizing the pre-processed text using the Word2Vec model for Word Embedding to obtain vector expressions of words; using BiLSTM-CRF model for automatic keyword extraction. [Results] Using the part-of-speech and BiLSTM-CRF network to perform experiments on the collected China National Knowledge Infrastructure text, the accuracy on Simple Word is improved by 3% compared to the original BiLSTM model. On Complex Word, the accuracy is improved by 12%. [Limitations] The journal keyword extraction model cannot accurately extract complex keywords. In future work, it is necessary to further remind the model of the performance of complex keywords. [Conclusions] Compared with the traditional method, the BiLSTM-CRF model with part-of-speech integration has higher recognition accuracy and is an effective keyword extraction method.

Key wordsExtraction      Conditional Random Field      Deep Learning      Bidirectional Long Short Term Memory     
Received: 06 December 2019      Published: 11 November 2020
ZTFLH:  TP393  
Corresponding Authors: Cheng Bin     E-mail: 1842729609@qq.com

Cite this article:

Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1306     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I3/101

Keyword Extraction for Journals Based on Part-of-speech and BiLSTM-CRF Combined Model
Skip-gram Model
LSTM Neuron Structure
LSTM Network Structure
BiLSTM Network Structure
BiLSTM-CRF Model
参数名称 参数值 参数名称 参数值
词向量维度 200 隐藏层 128
词性向量维度 10 BiLSTM模型层数 2
Batch_size 100 Dropout值 0.80
学习率 0.001 激活函数 Tanh
Model Parameter Settings
Case Case1 Case2 Case3 Case4 Case5
实验方法 LSTM BiLSTM BiLSTM-CRF 融合词性的BiLSTM 融合词性的BiLSTM-CRF
SW 准确率P(%) 83.72 84.23 84.65 84.52 86.57
召回率R(%) 79.33 81.28 83.74 82.37 85.16
F值(%) 81.50 82.73 84.19 83.43 85.86
CW 准确率P(%) 42.35 47.64 53.26 51.37 61.43
召回率R(%) 36.76 41.35 47.28 43.64 52.83
F值(%) 39.36 44.27 50.09 47.19 56.81
Experimental Results of Different Model Combinations
实验方法 指标 TextRank SGRank SingleRank 融合词性的BiLSTM-CRF
SW 准确率P(%) 45.67 53.63 48.64 86.57
召回率R(%) 43.12 52.14 47.33 85.16
F值(%) 44.36 52.87 47.98 85.86
CW 准确率P(%) 19.87 23.18 19.77 61.43
召回率R(%) 16.24 20.62 17.32 52.83
F值(%) 17.87 21.83 18.46 56.81
Experimental Results of Different Keyword Extraction Methods
[1] Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine[C]// Proceedings of the 7th International Conference on Advances in Web-Age Information Management, Hong Kong,China. Springer-Verlag, 2006: 85-96.
[2] Al-Saleh A B, Menai M E B. Automatic Arabic Text Summarization: A Survey[J]. Artificial Intelligence Review, 2015,45(2):203-234.
[3] Hulth A, Karlgren J, Jonsson A, et al. Automatic Keyword Extraction Using Domain Knowledge[C]// Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 2001: 472-482.
[4] Marujo L, Wang L, Trancoso I, et al. Automatic Keyword Extraction on Twitter[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. ACL, 2015: 637-643.
[5] Gollapalli S D, Li X L, Yang P. Incorporating Expert Knowledge into Keyphrase Extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 2017: 3180-3187.
[6] Li S J, Wang H F, Yu S W, et al. News-Oriented Automatic Chinese Keyword Indexing[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. IEEE, 2003: 92-97.
[7] Wang H F, Li S J, Yu S W, et al. A Combining Approach to Automatic Keyphrases Indexing for Chinese News Documents[C]// Proceedings of the 5th International Conference on Intelligent Text Processing and Computational Linguistics,. Springer, 2004: 441-444.
[8] Rumelhart D E, Hinton G E, Williams R J. Learning Representations by Back-propagating Errors[J]. Nature, 1986,323(6088):533-536.
[9] Medelyan O, Witten I H. Thesaurus-Based Index Term Extraction for Agricultural Documents[C]// Proceedings of the 6th Agricultural Ontology Service Workshop at EFITA/WCCA. IEEE, 2005: 1122-1129.
[10] Peter T. Learning to Extract Keyphrases from Text[R]. National Research Council, 2002.
[11] Kim S N, Kan M Y. Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles[C]// Proceedings of the 2009 ACL-IJCNLP Workshop on Multiword Espesssions.USA:ACL, 2009: 9-16.
[12] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508. 01991.
[13] 王序文, 李姣, 吴英杰, 等. 基于BiLSTM-CRF的中文生物医学开放式概念关系抽取[J]. 中华医学图书情报杂志, 2018,27(11):33-39.
[13] ( Wang Xuwen, Li Jiao, Wu Yingjie, et al. BiLSTM-CRF-Based Open Concept Relation Extraction from Chinese Biomedical Texts[J]. Chinese Journal of Medical Library and Information Science, 2018,27(11):33-39.)
[14] Chen Y, Zhou C J, Li T X, et al. Named Entity Recognition from Chinese Adverse Drug Event Reports with Lexical Feature Based BiLSTM-CRF and Tri-training[J]. Journal of Biomedical Informatics, 2019,96:103252.
doi: 10.1016/j.jbi.2019.103252 pmid: 31323311
[15] 程博, 李卫红, 童昊昕. 基于BiLSTM-CRF的中文层级地址分词[J]. 地球信息科学学报, 2019,21(8):1143-1151.
doi: 10.12082/dqxxkx.2019.180654
[15] ( Cheng Bo, Li Weihong, Tong Haoxin. Chinese Address Segmentation Based on BiLSTM-CRF[J]. Journal of Geo-Information Science, 2019,21(8):1143-1151.)
doi: 10.12082/dqxxkx.2019.180654
[16] Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 2019 World Wide Web Conference. 2019: 2551-2557.
[17] 语言云(语言技术平台云)[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/.
[17] (LTP[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/. )
[18] 宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
[18] ( Ning Jianfei, Liu Jiangzhen. Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
[19] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5-6):602-610.
doi: 10.1016/j.neunet.2005.06.042 pmid: 16112549
[20] 陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018,45(6A):91-96,113.
[20] ( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018,45(6A):91-96,113.)
[21] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNPL, 2004: 404-411.
[22] Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C]// Proceedings of the 4th Joint Conference on Lexical and Computational Semantics. ACL, 2015: 117-126.
[23] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]// Proceedings of the 23rd National Conference on Artificial IntelligenceAAAI, 2008: 855-860.
[1] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] Wang Yifan,Li Bo,Shi Hua,Miao Wei,Jiang Bin. Annotation Method for Extracting Entity Relationship from Ancient Chinese Works[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[3] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] Han Hui, Liu Xiuwen. Automatic Scoring for Subjective Questions in Maritime Competency Assessment[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[5] Chai Qingfeng, Shi Linyan, Mei Shan, Xiong Haitao, He Huixin. Extracting Knowledge Elements of Sci-Tech Literature Based on Artificial and Machine Features[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[6] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[7] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Chen Xingyue, Ni Liping, Ni Zhiwei. Extracting Financial Events with ELECTRA and Part-of-Speech[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[10] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[11] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[12] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[13] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[14] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn