基于多特征融合的政府工作报告关键词提取研究*

doi:10.11925/infotech.2096-3467.2021.0700

数据分析与知识发现

2022, Vol. 6

Issue (5): 54-63 https://doi.org/10.11925/infotech.2096-3467.2021.0700

研究论文

本期目录 | 过刊浏览 | 高级检索

基于多特征融合的政府工作报告关键词提取研究*

潘慧萍,李宝安,张乐,吕学强(

)

北京信息科技大学网络文化与数字传播北京市重点实验室北京 100101

Extracting Keywords from Government Work Reports with Multi-feature Fusion

Pan Huiping,Li Baoan,Zhang Le,Lv Xueqiang(

)

Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (859 KB) HTML ( 18 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 通过融合BERT词向量、五笔特征、领域同义词表信息以及字频特征于BiLSTM-CRF模型,实现对政府工作报告语料集的关键词自动提取。【方法】 利用BERT向量和五笔向量捕捉输入序列的语义特征和字形特征,通过融合针对政府工作报告所构建的领域同义词表,捕捉输入序列的类别特征,并进一步将字频特征作为权重值赋值于词向量捕捉输入序列上下文特征,使BiLSTM-CRF模型捕捉到更多的语义信息,实现对政府工作报告的关键词自动提取。【结果】 基于多特征融合的关键词提取方法,在自建的政府工作报告语料库上,准确率、召回率和F1值分别达到86.14%、91.56%以及88.42%。此外,通过消融实验评估了方法中各特征的有效性。【局限】 模型针对政府工作报告领域取得了较好的结果,在之后的工作中需要提高模型的泛化能力。【结论】 基于多特征融合的关键词提取方法与其他关键词提取基线方法相比,具有更好的提取效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	潘慧萍
	李宝安
	张乐
	吕学强

关键词 ：关键词提取, 政府工作报告, BERT, 五笔, 字频

Abstract：

[Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.

Key words： Extraction Government Work Report BERT Wubi Word Frequency

收稿日期: 2021-07-13 出版日期: 2022-03-01

ZTFLH:

TP393

基金资助:*国家自然科学基金项目(62171043);国家语言文字工作委员会重点项目的研究成果之一(ZDI145-10)

通讯作者: 吕学强,ORCID：0000-0002-1422-0560 E-mail: Lvxueqiang@aliyun.com

引用本文:

潘慧萍, 李宝安, 张乐, 吕学强. 基于多特征融合的政府工作报告关键词提取研究*[J]. 数据分析与知识发现, 2022, 6(5): 54-63.
Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang. Extracting Keywords from Government Work Reports with Multi-feature Fusion. Data Analysis and Knowledge Discovery, 2022, 6(5): 54-63.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0700 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/54

Fig.1 基于多特征融合的关键词提取模型整体架构

Table 1 实验环境

Table 2 政府工作报告关键词提取对比实验结果

Table 3 关键词提取模型结果的样例说明

Table4 各特征组合实验结果

[1]	王千弓, 杨江柱, 杨光汉. 秘书学概论(续二)[J]. 江汉大学学报, 1984, 12(2): 55-85.
[1]	Wang Qiangong, Yang Jiangzhu, Yang Guanghan. Introduction to the Short Book (Continued 2)[J]. Journal of Jianghan University, 1984, 12(2): 55-85.)
[2]	Yang Y, He L, Qiu M. Exploration and Improvement in Keyword Extraction for News Based on TFIDF[J]. Energy Procedia, 2011, 13: 3551-3556. doi: 10.1016/S1876-6102(14)00454-8
[3]	牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
[3]	( Niu Ping, Huang Degen. TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[4]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[5]	胡迁, 黄青松, 刘利军, 等. 基于主题与语义的对话语料关键词抽取方法[J]. 计算机应用与软件, 2018, 35(12): 27-32, 60.
[5]	( Hu Qian, Huang Qingsong, Liu Lijun, et al. Keywords Extract Method from Dialogue Corpus Based on the Topic and Semantic[J]. Computer Applications and Software, 2018, 35(12): 27-32, 60.)
[6]	Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[7]	夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9): 30-34.
[7]	( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9): 30-34.)
[8]	Zhang Y X, Chang Y C, Liu X Q, et al. Mike: Keyphrase Extraction by Integrating Multidimensional Information[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 1349-1358.
[9]	顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014 (7/8): 41-47.
[9]	( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
[10]	李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017, 36(8): 183-187.
[10]	( Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017, 36(8): 183-187.)
[11]	刘奇飞, 沈炜域. 基于Word2Vec和TextRank的时政类新闻关键词抽取方法研究[J]. 情报探索, 2018(6): 22-27.
[11]	( Liu Qifei, Shen Weiyu. Research of Keyword Extraction of Political News Based on Word2Vec and TextRank[J]. Information Research, 2018(6): 22-27.)
[12]	黄睿智, 黄德才. 词间关系的不确定图模型与关键词自动抽取方法[J]. 小型微型计算机系统, 2019, 40(2): 300-304.
[12]	( Huang Ruizhi, Huang Decai. Words’ Relation Based on Uncertain Graph and Automatic Keyword Extraction[J]. Journal of Chinese Computer Systems, 2019, 40(2): 300-304.)
[13]	孙福权, 张静静, 刘冰玉, 等. 基于万有引力改进的TextRank关键词提取算法[J]. 计算机应用与软件, 2020, 37(7): 216-220, 295.
[13]	( Sun Fuquan, Zhang Jingjing, Liu Bingyu, et al. An Improved TextRank Keyword Extraction Algorithm Based on Gravity[J]. Computer Applications and Software, 2020, 37(7): 216-220, 295.)
[14]	杨延娇, 赵国涛, 袁振强, 等. 融合语义特征的TextRank关键词抽取方法[J]. 计算机工程, 2021, 47(10): 82-88.
[14]	( Yang Yanjiao, Zhao Guotao, Yuan Zhenqiang, et al. TextRank-Based Keyword Extraction Method Integrating Semantic Features[J]. Computer Engineering, 2021, 47(10): 82-88.)
[15]	闫强, 张笑妍, 周思敏. 基于义原相似度的关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[15]	( Yan Qiang, Zhang Xiaoyan, Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.)
[16]	李俊. 面向人大报告的辅助写作研究[D]. 北京: 北京信息科技大学, 2020.
[16]	( Li Jun. Research on Complementary Writing for National People’s Congress Report[D]. Beijing: Beijing Information Science and Technology University, 2020.)
[17]	Frank E, Paynter G, Witten I, et al. Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on A.pngicial Intelligence. 1999: 668-673.
[18]	Wang J B, Peng H. Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine[C]// Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. 2005: 293-296.
[19]	Ding Z Y, Zhang Q, Huang X J. Keyphrase Extraction from Online News Using Binary Integer Programming[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 165-173.
[20]	Haddoud M, Mokhtari A, Lecroq T, et al. Accurate Keyphrase Extraction from Scie.pngic Papers by Mining Linguistic Information[C]// Proceedings of the 1st Workshop on Mining Scie.pngic Papers: Computational Linguistics and Bibliometrics. 2015: 12-17.
[21]	Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J]. Information Processing & Management, 2007, 43(6): 1705-1714. doi: 10.1016/j.ipm.2007.01.015
[22]	Zhang C Z, Wang H L, Liu Y, et al. Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
[23]	Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[24]	Basaldella M, Antolli E, Serra G, et al. Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of Italian Research Conference on Digital Libraries. Springer, Cham, 2018: 180-187.
[25]	Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the World Wide Web Conference. 2019: 2551-2557.
[26]	陈伟, 吴友政, 陈文亮, 等. 基于BiLSTM-CRF的关键词自动抽取[J]. 计算机科学, 2018, 45(S1): 91-96, 113.
[26]	( Chen Wei, Wu Youzheng, Chen Wenliang, et al. Automatic Keyword Extraction Based on BiLSTM-CRF[J]. Computer Science, 2018, 45(S1): 91-96, 113.)
[27]	成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[27]	( Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[28]	杨丹浩, 吴岳辛, 范春晓. 一种基于注意力机制的中文短文本关键词提取模型[J]. 计算机科学, 2020, 47(1): 193-198.
[28]	( Yang Danhao, Wu Yuexin, Fan Chunxiao. Chinese Short Text Keyphrase Extraction Model Based on Attention[J]. Computer Science, 2020, 47(1): 193-198.)
[29]	段建勇, 游世薪, 张梅, 等. 基于多特征融合的关键词抽取[J]. 计算机科学, 2020, 47(S2):73-77.
[29]	( Duan Jianyong, You Shixin, Zhang Mei, et al. Keyword Extraction Based on Multi-feature Fusion[J]. Computer Science, 2020, 47(S2): 73-77.)
[30]	Wang J K, Zhou J N, Zhou J. Multiple Character Embeddings for Chinese Word Segmentation[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:Student Research Workshop. 2019: 210-216.
[31]	董汉庭. 政府工作报告的特点和写作方法[A]//中国当代秘书群星文选[M]. 1999: 501-503.
[31]	( Dong Hanting. Characteristics and Writing Methods of Government Work Report[A]//Selected Articles of Chinese Contemporary Photography Stars[M]. 1999: 501-503.)
[32]	Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. doi: 10.1016/j.neunet.2005.06.042
[33]	Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.

[1]	施运梅, 袁博, 张乐, 吕学强. IMTS：融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[2]	郑洁, 黄辉, 秦永彬. 一种融合法律知识的相似案例匹配模型^*[J]. 数据分析与知识发现, 2022, 6(7): 99-106.
[3]	吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例^*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
[4]	肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究^*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[5]	杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究^*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[6]	王永生, 王昊, 虞为, 周泽聿. 融合结构和内容的方志文本人物关系抽取方法^*[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[7]	郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[8]	张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别^*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[9]	贾明华, 王秀利. 基于BERT和互信息的金融风险逻辑关系量化方法[J]. 数据分析与知识发现, 2022, 6(10): 68-78.
[10]	谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[11]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[12]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[13]	马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法^*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[14]	李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[15]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.

Viewed

Full text

Abstract

Cited

Shared

Discussed