Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (3): 101-108     https://doi.org/10.11925/infotech.2096-3467.2019.1306
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于融合词性的BiLSTM-CRF的期刊关键词抽取方法
成彬1(),施水才1,2,都云程1,2,肖诗斌1,2
1北京信息科技大学计算机学院 北京 100185
2北京拓尔思信息技术股份有限公司 北京 100101
Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model
Cheng Bin1(),Shi Shuicai1,2,Du Yuncheng1,2,Xiao Shibin1,2
1Computer School, Beijing Information Science & Technology University, Beijing 100185, China
2Beijing TRS Information Technology Co., Ltd., Beijing 100101, China
全文: PDF (761 KB)   HTML ( 21
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 利用CRF模型处理序列标注问题的优势,通过将词性信息和CRF模型融入BiLSTM网络,实现期刊关键词的自动抽取。【方法】 将关键词抽取问题视为一个序列标注问题。对期刊文本进行分词和词性标注的预处理;对预处理后的文本使用Word2Vec模型进行Word Embedding向量化,获取字词的向量表达式;使用BiLSTM-CRF模型进行关键词的自动抽取。【结果】 使用融合词性的BiLSTM-CRF网络,在采集的知网期刊文本上进行实验,在简单关键词方面,准确率较原始的BiLSTM模型提升3%;在复杂关键词方面,准确率较原始的BiLSTM模型提升12%。【局限】 期刊关键词抽取模型无法准确抽取复杂关键词,需要针对复杂关键词层面进一步提升模型性能。【结论】 融合词性的BiLSTM-CRF模型与传统方法相比,具有较高的识别准确率,是一种有效的关键词抽取方法。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
成彬
施水才
都云程
肖诗斌
关键词 抽取条件随机场深度学习双向长短期记忆网络    
Abstract

[Objective] Utilizing the advantages of the CRF model to solve the problem of sequence labeling, by incorporating part-of-speech information and the CRF model into the BiLSTM network, automatic extraction of journal keywords is realized. [Methods] The keyword extraction problem is considered as a sequence labeling problem. Pre-processing word segmentation and part-of-speech tagging of journal text; vectorizing the pre-processed text using the Word2Vec model for Word Embedding to obtain vector expressions of words; using BiLSTM-CRF model for automatic keyword extraction. [Results] Using the part-of-speech and BiLSTM-CRF network to perform experiments on the collected China National Knowledge Infrastructure text, the accuracy on Simple Word is improved by 3% compared to the original BiLSTM model. On Complex Word, the accuracy is improved by 12%. [Limitations] The journal keyword extraction model cannot accurately extract complex keywords. In future work, it is necessary to further remind the model of the performance of complex keywords. [Conclusions] Compared with the traditional method, the BiLSTM-CRF model with part-of-speech integration has higher recognition accuracy and is an effective keyword extraction method.

Key wordsExtraction    Conditional Random Field    Deep Learning    Bidirectional Long Short Term Memory
收稿日期: 2019-12-06      出版日期: 2020-11-11
ZTFLH:  TP393  
通讯作者: 成彬     E-mail: 1842729609@qq.com
引用本文:   
成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1306      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I3/101
Fig.1  融合词性与BiLSTM-CRF的关键词抽取模型
Fig.2  Skip-gram模型
Fig.3  LSTM的神经元结构
Fig.4  LSTM网络结构示例
Fig.5  BiLSTM网络结构示例
Fig.6  BiLSTM-CRF模型示例
参数名称 参数值 参数名称 参数值
词向量维度 200 隐藏层 128
词性向量维度 10 BiLSTM模型层数 2
Batch_size 100 Dropout值 0.80
学习率 0.001 激活函数 Tanh
Table 1  模型参数设置
Case Case1 Case2 Case3 Case4 Case5
实验方法 LSTM BiLSTM BiLSTM-CRF 融合词性的BiLSTM 融合词性的BiLSTM-CRF
SW 准确率P(%) 83.72 84.23 84.65 84.52 86.57
召回率R(%) 79.33 81.28 83.74 82.37 85.16
F值(%) 81.50 82.73 84.19 83.43 85.86
CW 准确率P(%) 42.35 47.64 53.26 51.37 61.43
召回率R(%) 36.76 41.35 47.28 43.64 52.83
F值(%) 39.36 44.27 50.09 47.19 56.81
Table 2  不同模型组合的实验结果
实验方法 指标 TextRank SGRank SingleRank 融合词性的BiLSTM-CRF
SW 准确率P(%) 45.67 53.63 48.64 86.57
召回率R(%) 43.12 52.14 47.33 85.16
F值(%) 44.36 52.87 47.98 85.86
CW 准确率P(%) 19.87 23.18 19.77 61.43
召回率R(%) 16.24 20.62 17.32 52.83
F值(%) 17.87 21.83 18.46 56.81
Table 3  不同关键词提取方法的实验结果
[1] Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine[C]// Proceedings of the 7th International Conference on Advances in Web-Age Information Management, Hong Kong,China. Springer-Verlag, 2006: 85-96.
[2] Al-Saleh A B, Menai M E B. Automatic Arabic Text Summarization: A Survey[J]. Artificial Intelligence Review, 2015,45(2):203-234.
[3] Hulth A, Karlgren J, Jonsson A, et al. Automatic Keyword Extraction Using Domain Knowledge[C]// Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing. Springer, 2001: 472-482.
[4] Marujo L, Wang L, Trancoso I, et al. Automatic Keyword Extraction on Twitter[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. ACL, 2015: 637-643.
[5] Gollapalli S D, Li X L, Yang P. Incorporating Expert Knowledge into Keyphrase Extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 2017: 3180-3187.
[6] Li S J, Wang H F, Yu S W, et al. News-Oriented Automatic Chinese Keyword Indexing[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. IEEE, 2003: 92-97.
[7] Wang H F, Li S J, Yu S W, et al. A Combining Approach to Automatic Keyphrases Indexing for Chinese News Documents[C]// Proceedings of the 5th International Conference on Intelligent Text Processing and Computational Linguistics,. Springer, 2004: 441-444.
[8] Rumelhart D E, Hinton G E, Williams R J. Learning Representations by Back-propagating Errors[J]. Nature, 1986,323(6088):533-536.
[9] Medelyan O, Witten I H. Thesaurus-Based Index Term Extraction for Agricultural Documents[C]// Proceedings of the 6th Agricultural Ontology Service Workshop at EFITA/WCCA. IEEE, 2005: 1122-1129.
[10] Peter T. Learning to Extract Keyphrases from Text[R]. National Research Council, 2002.
[11] Kim S N, Kan M Y. Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles[C]// Proceedings of the 2009 ACL-IJCNLP Workshop on Multiword Espesssions.USA:ACL, 2009: 9-16.
[12] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508. 01991.
[13] 王序文, 李姣, 吴英杰, 等. 基于BiLSTM-CRF的中文生物医学开放式概念关系抽取[J]. 中华医学图书情报杂志, 2018,27(11):33-39.
[13] ( Wang Xuwen, Li Jiao, Wu Yingjie, et al. BiLSTM-CRF-Based Open Concept Relation Extraction from Chinese Biomedical Texts[J]. Chinese Journal of Medical Library and Information Science, 2018,27(11):33-39.)
[14] Chen Y, Zhou C J, Li T X, et al. Named Entity Recognition from Chinese Adverse Drug Event Reports with Lexical Feature Based BiLSTM-CRF and Tri-training[J]. Journal of Biomedical Informatics, 2019,96:103252.
doi: 10.1016/j.jbi.2019.103252 pmid: 31323311
[15] 程博, 李卫红, 童昊昕. 基于BiLSTM-CRF的中文层级地址分词[J]. 地球信息科学学报, 2019,21(8):1143-1151.
doi: 10.12082/dqxxkx.2019.180654
[15] ( Cheng Bo, Li Weihong, Tong Haoxin. Chinese Address Segmentation Based on BiLSTM-CRF[J]. Journal of Geo-Information Science, 2019,21(8):1143-1151.)
doi: 10.12082/dqxxkx.2019.180654
[16] Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 2019 World Wide Web Conference. 2019: 2551-2557.
[17] 语言云(语言技术平台云)[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/.
[17] (LTP[EB/OL]. [2018-05-14]. http://www.ltp-cloud.com/. )
[18] 宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
[18] ( Ning Jianfei, Liu Jiangzhen. Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
[19] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5-6):602-610.
doi: 10.1016/j.neunet.2005.06.042 pmid: 16112549
[20] 陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018,45(6A):91-96,113.
[20] ( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018,45(6A):91-96,113.)
[21] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. EMNPL, 2004: 404-411.
[22] Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C]// Proceedings of the 4th Joint Conference on Lexical and Computational Semantics. ACL, 2015: 117-126.
[23] Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]// Proceedings of the 23rd National Conference on Artificial IntelligenceAAAI, 2008: 855-860.
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[3] 马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] 柴庆凤, 史霖炎, 梅珊, 熊海涛, 贺惠新. 基于人工特征和机器特征融合的科技文献知识元抽取*[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[5] 谭荧, 唐亦非. 基于指代消解的引文内容抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[6] 张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[7] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[8] 陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[9] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[10] 喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[11] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[12] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[13] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[14] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[15] 王义真,欧石燕,陈金菊. 民事裁判文书两阶段式自动摘要研究*[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn