稀疏样本下长术语的抽取方法<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.1231

数据分析与知识发现

2024, Vol. 8

Issue (1): 135-145 https://doi.org/10.11925/infotech.2096-3467.2022.1231

研究论文

本期目录 | 过刊浏览 | 高级检索

稀疏样本下长术语的抽取方法^*

吕学强¹,杨雨婷¹,肖刚²,李育贤¹,游新冬¹(

)

¹北京信息科技大学网络文化与数字传播北京市重点实验室北京 100101
²中国人民解放军军事科学院系统工程研究院复杂系统仿真总体重点实验室北京 100101

Extracting Long Terms from Sparse Samples

Lyu Xueqiang¹,Yang Yuting¹,Xiao Gang²,Li Yuxian¹,You Xindong¹(

)

¹Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
²General Key Laboratory of Complex System Simulation, Institute of Systems Engineering, Academy of Military Sciences, Beijing 100101, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1047 KB) HTML ( 13 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 为解决武器装备领域样本稀疏和长术语难以识别的问题，提出头尾指针和主动学习相结合的方法。【方法】 首先，使用BERT预训练语言模型得到词向量表示，利用头尾指针网络对长术语进行抽取；然后提出新的主动学习采样策略，在未标注样本上筛选高质量样本不断迭代训练模型，降低模型对数据规模的依赖。【结果】 模型针对长术语的抽取效果在F1值上有0.50个百分点的提升，通过引入主动学习后采样，仅大约50%高质量数据即可达到训练100%训练数据相同的F1值。【局限】 限于计算能力，本文数据集规模较小；在文本处理阶段新增主动学习采样策略，进行大规模数据计算的时间成本较高。【结论】 利用头尾指针和主动学习方法能够有效抽取长术语，同时降低数据标注的成本。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	吕学强
	杨雨婷
	肖刚
	李育贤
	游新冬

关键词 ：术语抽取, 主动学习, 头尾指针网络, BERT, 武器装备

Abstract：

[Objective] This paper proposes a model combining head and tail pointers with active learning, which addresses the sparse sample issues and helps us identify long terms on weapons. [Methods] Firstly, we used the BERT pre-trained language model to obtain the word vector representation. Then, we extracted the long terms by the head-tail pointer network. Third, we developed a new active learning sampling strategy to select high-quality unlabeled samples. Finally, we iteratively trained the model to reduce its dependence on the data scale. [Results] The F1 value for long term extraction was improved by 0.50%. With the help of active learning post-sampling, we used about 50% high-quality data to achieve the same F1 value with 100% high-quality training data. [Limitations] Due to the limitation of computing power, the data set in this paper was small, and the active learning sampling strategy requires more processing time. [Conclusions] Using head-tail pointer and active learning method can extract long terms effectively and reduce the cost of data annotation.

Key words： Term Extraction Active Learning Head-to-Tail Pointer Network BERT Weaponry

收稿日期: 2022-11-21 出版日期: 2023-05-16

ZTFLH:	TP391
	G250

基金资助:*国家自然科学基金项目(62171043);国防科技重点实验室基金项目(6412006200404);北京市自然科学基金项目(4212020)

通讯作者: 游新冬，ORCID：0000-0002-3351-4599，E-mail：youxindong@bistu.edu.cn。

引用本文:

吕学强, 杨雨婷, 肖刚, 李育贤, 游新冬. 稀疏样本下长术语的抽取方法^*[J]. 数据分析与知识发现, 2024, 8(1): 135-145.
Lyu Xueqiang, Yang Yuting, Xiao Gang, Li Yuxian, You Xindong. Extracting Long Terms from Sparse Samples. Data Analysis and Knowledge Discovery, 2024, 8(1): 135-145.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1231 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I1/135

Fig.1 基于头尾指针的术语识别模型

Fig.2 主动学习流程示意图

Fig.3 数据长度分布

Fig.4 术语长度分布

Table 1 训练环境配置

Table 2 抽取模型对比实验

Table 3 样例对比

Table 4 主动学习算法对比实验

Table 5 对比实验结果

Fig.5 对比实验结果

Table 6 数据规模对模型性能的影响

[1]	胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[1]	(Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 7-17.)
[2]	李思良, 许斌, 杨玉基. DRTE: 面向基础教育的术语抽取方法[J]. 中文信息学报, 2018, 32(3): 101-109.
[2]	(Li Siliang, Xu Bin, Yang Yuji. DRTE: A Term Extraction Method for K12 Education[J]. Journal of Chinese Information Processing, 2018, 32(3): 101-109.)
[3]	Aria M, Cuccurullo C, Gnasso A. A Comparison Among Interpretative Proposals for Random Forests[J]. Machine Learning with Applications, 2021, 6: Article No.100094.
[4]	吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
[4]	(Wu Jun, Cheng Yao, Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418.)
[5]	Staudemeyer R C, Morris E R. Understanding LSTM——A Tutorial into Long Short-Term Memory Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1909.09586.
[6]	Ren L, Cheng X J, Wang X K, et al. Multi-Scale Dense Gate Recurrent Unit Networks for Bearing Remaining Useful Life Prediction[J]. Future Generation Computer Systems, 2019, 94: 601-609. doi: 10.1016/j.future.2018.12.009
[7]	赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9): 923-938.
[7]	(Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9): 923-938.)
[8]	Kucza M, Niehues J, Zenkel T, et al. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks[C]// Proceedings of the Interspeech 2018. 2018: 2072-2076.
[9]	王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023(1): 49-57.
[9]	(Wang Haochang, Liu Ruyi. Review of Relation Extraction Based on Pre-Training Language Model[J]. Computer and Modernization, 2023(1): 49-57.)
[10]	刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[10]	(Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[11]	张乐, 唐亮, 易绵竹. 融合多策略的军事领域中文术语抽取研究[J]. 现代计算机, 2020(26): 9-16.
[11]	(Zhang Le, Tang Liang, Yi Mianzhu. Research on the Extraction of Military Chinese Terminology Integrating Multi-Strategies[J]. Modern Computer, 2020(26): 9-16.)
[12]	Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]// Proceedings of the 20th National Conference on Artificial Intelligence. AAAI Press, 2005: 746-751.
[13]	Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]// Proceedings of the 4th International Symposium on Intelligent Data Analysis. 2001: 309-318.
[14]	Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 1707.05928.
[15]	胡佳慧, 赵琬清, 方安, 等. 基于主动学习的中文电子病历命名实体识别研究[J]. 中国数字医学, 2020, 15(11): 6-9.
[15]	(Hu Jiahui, Zhao Wanqing, Fang An, et al. A Study of Chinese Electronic Medical Record Named Entity Recognition Based on Active Learning[J]. China Digital Medicine, 2020, 15(11): 6-9.)
[16]	俞敬松, 吴聪, 曹喜信. 政府公文领域细粒度命名实体识别的实用化研究与设计[J]. 微纳电子与智能制造, 2020, 2(3): 23-29.
[16]	(Yu Jingsong, Wu Cong, Cao Xixin. Research on Fine-Grained Named Entity Recognition in Government Documents Based on Deep Active Learning[J]. Micro/Nano Electronics and Intelligent Manufacturing, 2020, 2(3): 23-29.)
[17]	尹学振. 多神经网络协作的军事领域命名实体识别关键技术研究[D]. 上海: 华东师范大学, 2020.
[17]	(Yin Xuezhen. Chinese Military Named Entity Recognition Using Multi-Neural Network Collaboration[D]. Shanghai: East China Normal University, 2020.)
[18]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[19]	吴炳潮, 邓成龙, 关贝, 等. 动态迁移实体块信息的跨领域中文实体识别模型[J]. 软件学报, 2022, 33(10): 3776-3792.
[19]	(Wu Bingchao, Deng Chenglong, Guan Bei, et al. Dynamically Transfer Entity Span Information for Cross-Domain Chinese Named Entity Recognition[J]. Journal of Software, 2022, 33(10): 3776-3792.)

[1]	贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析^*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2]	赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3]	本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究^*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[4]	徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法^*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[5]	苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取^*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[6]	赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究^*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[7]	王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现^*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[8]	张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型^*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[9]	吕学强, 杜一凡, 张乐, 潘慧萍, 田驰. GKTR：一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型^*[J]. 数据分析与知识发现, 2023, 7(12): 155-163.
[10]	吴旭旭, 陈鹏, 江欢. 基于多特征融合的微博细粒度情感分析^*[J]. 数据分析与知识发现, 2023, 7(12): 102-113.
[11]	高浩鑫, 孙利娟, 吴京宸, 高宇童, 吴旭. 基于异构图卷积网络的网络社区敏感文本分类模型^*[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[12]	李楠, 汪波. 跨学科语义漂移识别与可视化分析^*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[13]	潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建^*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[14]	施运梅, 袁博, 张乐, 吕学强. IMTS：融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[15]	吴江, 刘涛, 刘洋. 在线社区用户画像及自我呈现主题挖掘——以网易云音乐社区为例^*[J]. 数据分析与知识发现, 2022, 6(7): 56-69.

Viewed

Full text

Abstract

Cited

Shared

Discussed