|
|
Extracting Keywords from Government Work Reports with Multi-feature Fusion |
Pan Huiping,Li Baoan,Zhang Le,Lv Xueqiang() |
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China |
|
|
Abstract [Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.
|
Received: 13 July 2021
Published: 01 March 2022
|
|
Fund:National Natural Science Foundation of China(62171043);Key Program of the National Language Commission(ZDI145-10) |
Corresponding Authors:
Lv Xueqiang,ORCID:0000-0002-1422-0560
E-mail: Lvxueqiang@aliyun.com
|
[1] |
王千弓, 杨江柱, 杨光汉. 秘书学概论(续二)[J]. 江汉大学学报, 1984, 12(2): 55-85.
|
[1] |
Wang Qiangong, Yang Jiangzhu, Yang Guanghan. Introduction to the Short Book (Continued 2)[J]. Journal of Jianghan University, 1984, 12(2): 55-85.)
|
[2] |
Yang Y, He L, Qiu M. Exploration and Improvement in Keyword Extraction for News Based on TFIDF[J]. Energy Procedia, 2011, 13: 3551-3556.
doi: 10.1016/S1876-6102(14)00454-8
|
[3] |
牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
|
[3] |
( Niu Ping, Huang Degen. TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
|
[4] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
|
[5] |
胡迁, 黄青松, 刘利军, 等. 基于主题与语义的对话语料关键词抽取方法[J]. 计算机应用与软件, 2018, 35(12): 27-32, 60.
|
[5] |
( Hu Qian, Huang Qingsong, Liu Lijun, et al. Keywords Extract Method from Dialogue Corpus Based on the Topic and Semantic[J]. Computer Applications and Software, 2018, 35(12): 27-32, 60.)
|
[6] |
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
|
[7] |
夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
|
[7] |
( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
|
[8] |
Zhang Y X, Chang Y C, Liu X Q, et al. Mike: Keyphrase Extraction by Integrating Multidimensional Information[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1349-1358.
|
[9] |
顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014 (7/8): 41-47.
|
[9] |
( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
|
[10] |
李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017, 36(8): 183-187.
|
[10] |
( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017, 36(8): 183-187.)
|
[11] |
刘奇飞, 沈炜域. 基于Word2Vec和TextRank的时政类新闻关键词抽取方法研究[J]. 情报探索, 2018(6): 22-27.
|
[11] |
( Liu Qifei, Shen Weiyu. Research of Keyword Extraction of Political News Based on Word2Vec and TextRank[J]. Information Research, 2018(6): 22-27.)
|
[12] |
黄睿智, 黄德才. 词间关系的不确定图模型与关键词自动抽取方法[J]. 小型微型计算机系统, 2019, 40(2): 300-304.
|
[12] |
( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
|
[13] |
孙福权, 张静静, 刘冰玉, 等. 基于万有引力改进的TextRank关键词提取算法[J]. 计算机应用与软件, 2020, 37(7): 216-220, 295.
|
[13] |
( Sun Fuquan, Zhang Jingjing, Liu Bingyu, et al. An Improved TextRank Keyword Extraction Algorithm Based on Gravity[J]. Computer Applications and Software, 2020, 37(7): 216-220, 295.)
|
[14] |
杨延娇, 赵国涛, 袁振强, 等. 融合语义特征的TextRank关键词抽取方法[J]. 计算机工程, 2021, 47(10): 82-88.
|
[14] |
( Yang Yanjiao, Zhao Guotao, Yuan Zhenqiang, et al. TextRank-Based Keyword Extraction Method Integrating Semantic Features[J]. Computer Engineering, 2021, 47(10): 82-88.)
|
[15] |
闫强, 张笑妍, 周思敏. 基于义原相似度的关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
|
[15] |
( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
|
[16] |
李俊. 面向人大报告的辅助写作研究[D]. 北京: 北京信息科技大学, 2020.
|
[16] |
( Li Jun. Research on Complementary Writing for National People’s Congress Report[D]. Beijing: Beijing Information Science and Technology University, 2020.)
|
[17] |
Frank E, Paynter G, Witten I, et al. Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on A.pngicial Intelligence. 1999: 668-673.
|
[18] |
Wang J B, Peng H. Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine[C]// Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 293-296.
|
[19] |
Ding Z Y, Zhang Q, Huang X J. Keyphrase Extraction from Online News Using Binary Integer Programming[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 165-173.
|
[20] |
Haddoud M, Mokhtari A, Lecroq T, et al. Accurate Keyphrase Extraction from Scie.pngic Papers by Mining Linguistic Information[C]// Proceedings of the 1st Workshop on Mining Scie.pngic Papers: Computational Linguistics and Bibliometrics. 2015: 12-17.
|
[21] |
Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J]. Information Processing & Management, 2007, 43(6): 1705-1714.
doi: 10.1016/j.ipm.2007.01.015
|
[22] |
Zhang C Z, Wang H L, Liu Y, et al. Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
|
[23] |
Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
|
[24] |
Basaldella M, Antolli E, Serra G, et al. Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of Italian Research Conference on Digital Libraries. Springer, Cham, 2018: 180-187.
|
[25] |
Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the World Wide Web Conference. 2019: 2551-2557.
|
[26] |
陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018, 45(S1): 91-96, 113.
|
[26] |
( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018, 45(S1): 91-96, 113.)
|
[27] |
成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
|
[27] |
( Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
|
[28] |
杨丹浩, 吴岳辛, 范春晓. 一种基于注意力机制的中文短文本关键词提取模型[J]. 计算机科学, 2020, 47(1): 193-198.
|
[28] |
( Yang Danhao, Wu Yuexin, Fan Chunxiao. Chinese Short Text Keyphrase Extraction Model Based on Attention[J]. Computer Science, 2020, 47(1): 193-198.)
|
[29] |
段建勇, 游世薪, 张梅, 等. 基于多特征融合的关键词抽取[J]. 计算机科学, 2020, 47(S2):73-77.
|
[29] |
( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
|
[30] |
Wang J K, Zhou J N, Zhou J. Multiple Character Embeddings for Chinese Word Segmentation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop. 2019: 210-216.
|
[31] |
董汉庭. 政府工作报告的特点和写作方法[A]//中国当代秘书群星文选[M]. 1999: 501-503.
|
[31] |
( Dong Hanting. Characteristics and Writing Methods of Government Work Report[A]//Selected Articles of Chinese Contemporary Photography Stars[M]. 1999: 501-503.)
|
[32] |
Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005, 18(5/6): 602-610.
doi: 10.1016/j.neunet.2005.06.042
|
[33] |
Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|