Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (5): 54-63    DOI: 10.11925/infotech.2096-3467.2021.0700
Current Issue | Archive | Adv Search |
Extracting Keywords from Government Work Reports with Multi-feature Fusion
Pan Huiping,Li Baoan,Zhang Le,Lv Xueqiang()
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF (859 KB)   HTML ( 18
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.

Key wordsExtraction      Government Work Report      BERT      Wubi      Word Frequency     
Received: 13 July 2021      Published: 01 March 2022
ZTFLH:  TP393  
Fund:National Natural Science Foundation of China(62171043);Key Program of the National Language Commission(ZDI145-10)
Corresponding Authors: Lv Xueqiang,ORCID:0000-0002-1422-0560     E-mail: Lvxueqiang@aliyun.com

Cite this article:

Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang. Extracting Keywords from Government Work Reports with Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(5): 54-63.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0700     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I5/54

Architecture of Keyword Extraction Model Based on Multi-feature Fusion
环境 版本
操作系统 Linux
CPU Intel(R) Xeon(R) Gold 5118 CPU @2.30GHz
显卡 Tesla P4
Python 3.6.9
PyTorch 1.6.0
Experimental Environment
序号 模型 P/% R/% F1/%
1 BertVecRank 42.60 42.60 42.60
2 Word2Vec-BiLSTM-CRF 67.81 59.33 63.29
3 BERT-BiLSTM-CRF 83.30 86.88 84.64
4 基于多特征融合的关键词提取方法 86.14 91.56 88.42
Experimental Results of Keyword Extraction in Government Work Report
序号 语段 标注
关键词
BertVecRank Word2Vec-BiLSTM-CRF BERT-BiLSTM-CRF 基于多特征融合的关键词提取方法
1 农村综合改革稳步推进,农业税全部取消。企业改革继续深化,60%的省管企业完成了主辅分离,74家上市公司全部完成或进入股权分置改革程序。民营经济不断发展壮大,非公有制经济占全省生产总值的比重达到52%。 企业
经济
改革
企业
改革
发展
企业
改革
发展
企业
民营
改革
企业
经济
改革
2 深入开展“执政为民、服务发展”学习整改活动,强化“五种观念”,解决“五大问题”。政府各部门针对思想禁锢、程序繁琐、效率不高和官僚习气等突出问题,认真查找整改,服务意识增强,工作作风得到改进。深化行政管理体制改革,下放了一批行政管理权限,精简了一批行政审批事项,减少了一批行政事业性收费项目,发展环境进一步改善。 行政
服务
整改
行政
发展
管理
行政
服务
问题
行政
服务
整改
行政
服务
整改
3 (七)努力做好就业再就业和社会保障工作,切实解决人民等措施,鼓励和支持劳动者自主创业。落实社会保险补贴和岗位补贴政策,鼓励企业吸纳更多下岗失业人员再就业。落实就业培训补贴政策,加强城乡劳动力职业技能培训。城市低保政策与就业政策联动,指导和帮助16.3万下岗失业人员实现就业再就业。 就业
补贴
创业
就业
补贴
再就业
就业
补贴
再就业
就业
政策
创业
就业
补贴
创业
Example Description of Keyword Extraction Model Results
序号 特征组合 P/% R/% F1/%
A BERT-Wubi-BiLSTM-CRF 84.35 89.66 86.57
B BERT-Count-BiLSTM-CRF 85.37 90.05 87.28
C BERT-Wubi-Count-BiLSTM-CRF 86.03 90.61 87.89
D 基于多特征融合的关键词提取方法 86.14 91.56 88.42
Experimental Results of Each Feature Combination
[1] 王千弓, 杨江柱, 杨光汉. 秘书学概论(续二)[J]. 江汉大学学报, 1984, 12(2): 55-85.
[1] Wang Qiangong, Yang Jiangzhu, Yang Guanghan. Introduction to the Short Book (Continued 2)[J]. Journal of Jianghan University, 1984, 12(2): 55-85.)
[2] Yang Y, He L, Qiu M. Exploration and Improvement in Keyword Extraction for News Based on TFIDF[J]. Energy Procedia, 2011, 13: 3551-3556.
doi: 10.1016/S1876-6102(14)00454-8
[3] 牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
[3] ( Niu Ping, Huang Degen. TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[4] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[5] 胡迁, 黄青松, 刘利军, 等. 基于主题与语义的对话语料关键词抽取方法[J]. 计算机应用与软件, 2018, 35(12): 27-32, 60.
[5] ( Hu Qian, Huang Qingsong, Liu Lijun, et al. Keywords Extract Method from Dialogue Corpus Based on the Topic and Semantic[J]. Computer Applications and Software, 2018, 35(12): 27-32, 60.)
[6] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[7] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[7] ( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[8] Zhang Y X, Chang Y C, Liu X Q, et al. Mike: Keyphrase Extraction by Integrating Multidimensional Information[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1349-1358.
[9] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014 (7/8): 41-47.
[9] ( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
[10] 李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017, 36(8): 183-187.
[10] ( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017, 36(8): 183-187.)
[11] 刘奇飞, 沈炜域. 基于Word2Vec和TextRank的时政类新闻关键词抽取方法研究[J]. 情报探索, 2018(6): 22-27.
[11] ( Liu Qifei, Shen Weiyu. Research of Keyword Extraction of Political News Based on Word2Vec and TextRank[J]. Information Research, 2018(6): 22-27.)
[12] 黄睿智, 黄德才. 词间关系的不确定图模型与关键词自动抽取方法[J]. 小型微型计算机系统, 2019, 40(2): 300-304.
[12] ( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
[13] 孙福权, 张静静, 刘冰玉, 等. 基于万有引力改进的TextRank关键词提取算法[J]. 计算机应用与软件, 2020, 37(7): 216-220, 295.
[13] ( Sun Fuquan, Zhang Jingjing, Liu Bingyu, et al. An Improved TextRank Keyword Extraction Algorithm Based on Gravity[J]. Computer Applications and Software, 2020, 37(7): 216-220, 295.)
[14] 杨延娇, 赵国涛, 袁振强, 等. 融合语义特征的TextRank关键词抽取方法[J]. 计算机工程, 2021, 47(10): 82-88.
[14] ( Yang Yanjiao, Zhao Guotao, Yuan Zhenqiang, et al. TextRank-Based Keyword Extraction Method Integrating Semantic Features[J]. Computer Engineering, 2021, 47(10): 82-88.)
[15] 闫强, 张笑妍, 周思敏. 基于义原相似度的关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[15] ( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
[16] 李俊. 面向人大报告的辅助写作研究[D]. 北京: 北京信息科技大学, 2020.
[16] ( Li Jun. Research on Complementary Writing for National People’s Congress Report[D]. Beijing: Beijing Information Science and Technology University, 2020.)
[17] Frank E, Paynter G, Witten I, et al. Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on A.pngicial Intelligence. 1999: 668-673.
[18] Wang J B, Peng H. Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine[C]// Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 293-296.
[19] Ding Z Y, Zhang Q, Huang X J. Keyphrase Extraction from Online News Using Binary Integer Programming[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 165-173.
[20] Haddoud M, Mokhtari A, Lecroq T, et al. Accurate Keyphrase Extraction from Scie.pngic Papers by Mining Linguistic Information[C]// Proceedings of the 1st Workshop on Mining Scie.pngic Papers: Computational Linguistics and Bibliometrics. 2015: 12-17.
[21] Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J]. Information Processing & Management, 2007, 43(6): 1705-1714.
doi: 10.1016/j.ipm.2007.01.015
[22] Zhang C Z, Wang H L, Liu Y, et al. Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
[23] Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[24] Basaldella M, Antolli E, Serra G, et al. Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of Italian Research Conference on Digital Libraries. Springer, Cham, 2018: 180-187.
[25] Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the World Wide Web Conference. 2019: 2551-2557.
[26] 陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018, 45(S1): 91-96, 113.
[26] ( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018, 45(S1): 91-96, 113.)
[27] 成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[27] ( Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[28] 杨丹浩, 吴岳辛, 范春晓. 一种基于注意力机制的中文短文本关键词提取模型[J]. 计算机科学, 2020, 47(1): 193-198.
[28] ( Yang Danhao, Wu Yuexin, Fan Chunxiao. Chinese Short Text Keyphrase Extraction Model Based on Attention[J]. Computer Science, 2020, 47(1): 193-198.)
[29] 段建勇, 游世薪, 张梅, 等. 基于多特征融合的关键词抽取[J]. 计算机科学, 2020, 47(S2):73-77.
[29] ( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
[30] Wang J K, Zhou J N, Zhou J. Multiple Character Embeddings for Chinese Word Segmentation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop. 2019: 210-216.
[31] 董汉庭. 政府工作报告的特点和写作方法[A]//中国当代秘书群星文选[M]. 1999: 501-503.
[31] ( Dong Hanting. Characteristics and Writing Methods of Government Work Report[A]//Selected Articles of Chinese Contemporary Photography Stars[M]. 1999: 501-503.)
[32] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005, 18(5/6): 602-610.
doi: 10.1016/j.neunet.2005.06.042
[33] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[1] Yang Meifang, Yang Bo. Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model[J]. 数据分析与知识发现, 2022, 6(9): 86-99.
[2] Zhao Pengwu, Li Zhiyi, Lin Xiaoqi. Identifying Relationship of Chinese Characters with Attention Mechanism and Convolutional Neural Network[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[3] Shi Yunmei, Yuan Bo, Zhang Le, Lv Xueqiang. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[4] Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[5] Zheng Jie, Huang Hui, Qin Yongbin. Matching Similar Cases with Legal Knowledge Fusion[J]. 数据分析与知识发现, 2022, 6(7): 99-106.
[6] Jing Shenqi, Zhao Youlin. Extracting Medical Entity Relationships with Domain-Specific Knowledge and Distant Supervision[J]. 数据分析与知识发现, 2022, 6(6): 105-114.
[7] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[8] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[9] Liu Kan, Xu Qinya, Yu Lu. Constructing Knowledge Graph for Business Environment[J]. 数据分析与知识发现, 2022, 6(4): 82-96.
[10] Ding Shengchun, You Weijing, Wang Xiaoying. Extracting Weapon Attributes Based on Word Completion[J]. 数据分析与知识发现, 2022, 6(2/3): 289-297.
[11] Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao. Extracting Chinese Patent Keywords with LSTM and Logistic Regression[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[12] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[13] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[14] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[15] Sheng Yu, Hu Huirong, Wang Congcong, Yang Shengyi. Analyzing Structures of Medical Imaging Diagnosis Reports[J]. 数据分析与知识发现, 2022, 6(10): 46-56.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn