Extracting Keywords from Government Work Reports with Multi-feature Fusion
Pan Huiping,Li Baoan,Zhang Le,Lv Xueqiang()
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
[Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.
潘慧萍, 李宝安, 张乐, 吕学强. 基于多特征融合的政府工作报告关键词提取研究*[J]. 数据分析与知识发现, 2022, 6(5): 54-63.
Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang. Extracting Keywords from Government Work Reports with Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(5): 54-63.
Wang Qiangong, Yang Jiangzhu, Yang Guanghan. Introduction to the Short Book (Continued 2)[J]. Journal of Jianghan University, 1984, 12(2): 55-85.)
[2]
Yang Y, He L, Qiu M. Exploration and Improvement in Keyword Extraction for News Based on TFIDF[J]. Energy Procedia, 2011, 13: 3551-3556.
doi: 10.1016/S1876-6102(14)00454-8
( Niu Ping, Huang Degen. TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[4]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
( Hu Qian, Huang Qingsong, Liu Lijun, et al. Keywords Extract Method from Dialogue Corpus Based on the Topic and Semantic[J]. Computer Applications and Software, 2018, 35(12): 27-32, 60.)
[6]
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[8]
Zhang Y X, Chang Y C, Liu X Q, et al. Mike: Keyphrase Extraction by Integrating Multidimensional Information[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1349-1358.
( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017, 36(8): 183-187.)
( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
( Sun Fuquan, Zhang Jingjing, Liu Bingyu, et al. An Improved TextRank Keyword Extraction Algorithm Based on Gravity[J]. Computer Applications and Software, 2020, 37(7): 216-220, 295.)
( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
[16]
李俊. 面向人大报告的辅助写作研究[D]. 北京: 北京信息科技大学, 2020.
[16]
( Li Jun. Research on Complementary Writing for National People’s Congress Report[D]. Beijing: Beijing Information Science and Technology University, 2020.)
[17]
Frank E, Paynter G, Witten I, et al. Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on A.pngicial Intelligence. 1999: 668-673.
[18]
Wang J B, Peng H. Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine[C]// Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 293-296.
[19]
Ding Z Y, Zhang Q, Huang X J. Keyphrase Extraction from Online News Using Binary Integer Programming[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 165-173.
[20]
Haddoud M, Mokhtari A, Lecroq T, et al. Accurate Keyphrase Extraction from Scie.pngic Papers by Mining Linguistic Information[C]// Proceedings of the 1st Workshop on Mining Scie.pngic Papers: Computational Linguistics and Bibliometrics. 2015: 12-17.
[21]
Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J]. Information Processing & Management, 2007, 43(6): 1705-1714.
doi: 10.1016/j.ipm.2007.01.015
[22]
Zhang C Z, Wang H L, Liu Y, et al. Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
[23]
Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[24]
Basaldella M, Antolli E, Serra G, et al. Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of Italian Research Conference on Digital Libraries. Springer, Cham, 2018: 180-187.
[25]
Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the World Wide Web Conference. 2019: 2551-2557.
( Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
[30]
Wang J K, Zhou J N, Zhou J. Multiple Character Embeddings for Chinese Word Segmentation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop. 2019: 210-216.
( Dong Hanting. Characteristics and Writing Methods of Government Work Report[A]//Selected Articles of Chinese Contemporary Photography Stars[M]. 1999: 501-503.)
[32]
Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005, 18(5/6): 602-610.
doi: 10.1016/j.neunet.2005.06.042
[33]
Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.