Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (3): 101-108    DOI: 10.11925/infotech.2096-3467.2019.1306
Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model
Cheng Bin1(),Shi Shuicai1,2,Du Yuncheng1,2,Xiao Shibin1,2
1Computer School, Beijing Information Science & Technology University, Beijing 100185, China
2Beijing TRS Information Technology Co., Ltd., Beijing 100101, China
[Objective] Utilizing the advantages of the CRF model to solve the problem of sequence labeling, by incorporating part-of-speech information and the CRF model into the BiLSTM network, automatic extraction of journal keywords is realized. [Methods] The keyword extraction problem is considered as a sequence labeling problem. Pre-processing word segmentation and part-of-speech tagging of journal text; vectorizing the pre-processed text using the Word2Vec model for Word Embedding to obtain vector expressions of words; using BiLSTM-CRF model for automatic keyword extraction. [Results] Using the part-of-speech and BiLSTM-CRF network to perform experiments on the collected China National Knowledge Infrastructure text, the accuracy on Simple Word is improved by 3% compared to the original BiLSTM model. On Complex Word, the accuracy is improved by 12%. [Limitations] The journal keyword extraction model cannot accurately extract complex keywords. In future work, it is necessary to further remind the model of the performance of complex keywords. [Conclusions] Compared with the traditional method, the BiLSTM-CRF model with part-of-speech integration has higher recognition accuracy and is an effective keyword extraction method.

Key wordsExtraction      Conditional Random Field      Deep Learning      Bidirectional Long Short Term Memory     
Received: 06 December 2019      Published: 11 November 2020
ZTFLH:  TP393  
Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.

Keyword Extraction for Journals Based on Part-of-speech and BiLSTM-CRF Combined Model
Skip-gram Model
LSTM Neuron Structure
LSTM Network Structure
BiLSTM Network Structure
参数名称 参数值 参数名称 参数值
词向量维度 200 隐藏层 128
词性向量维度 10 BiLSTM模型层数 2
Batch_size 100 Dropout值 0.80
学习率 0.001 激活函数 Tanh
Model Parameter Settings
Case Case1 Case2 Case3 Case4 Case5
SW 准确率P(%) 83.72 84.23 84.65 84.52 86.57
召回率R(%) 79.33 81.28 83.74 82.37 85.16
F值(%) 81.50 82.73 84.19 83.43 85.86
CW 准确率P(%) 42.35 47.64 53.26 51.37 61.43
召回率R(%) 36.76 41.35 47.28 43.64 52.83
F值(%) 39.36 44.27 50.09 47.19 56.81
Experimental Results of Different Model Combinations
实验方法 指标 TextRank SGRank SingleRank 融合词性的BiLSTM-CRF
SW 准确率P(%) 45.67 53.63 48.64 86.57
召回率R(%) 43.12 52.14 47.33 85.16
F值(%) 44.36 52.87 47.98 85.86
CW 准确率P(%) 19.87 23.18 19.77 61.43
召回率R(%) 16.24 20.62 17.32 52.83
F值(%) 17.87 21.83 18.46 56.81
Experimental Results of Different Keyword Extraction Methods
