Extracting Long Terms from Sparse Samples

doi:10.11925/infotech.2096-3467.2022.1231

Data Analysis and Knowledge Discovery

2024, Vol. 8

Issue (1): 135-145 DOI: 10.11925/infotech.2096-3467.2022.1231

Current Issue | Archive | Adv Search

Extracting Long Terms from Sparse Samples

Lyu Xueqiang¹,Yang Yuting¹,Xiao Gang²,Li Yuxian¹,You Xindong¹(

)

¹Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
²General Key Laboratory of Complex System Simulation, Institute of Systems Engineering, Academy of Military Sciences, Beijing 100101, China

Download: PDF (1047 KB) HTML ( 13 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This paper proposes a model combining head and tail pointers with active learning, which addresses the sparse sample issues and helps us identify long terms on weapons. [Methods] Firstly, we used the BERT pre-trained language model to obtain the word vector representation. Then, we extracted the long terms by the head-tail pointer network. Third, we developed a new active learning sampling strategy to select high-quality unlabeled samples. Finally, we iteratively trained the model to reduce its dependence on the data scale. [Results] The F1 value for long term extraction was improved by 0.50%. With the help of active learning post-sampling, we used about 50% high-quality data to achieve the same F1 value with 100% high-quality training data. [Limitations] Due to the limitation of computing power, the data set in this paper was small, and the active learning sampling strategy requires more processing time. [Conclusions] Using head-tail pointer and active learning method can extract long terms effectively and reduce the cost of data annotation.

Key words： Term Extraction Active Learning Head-to-Tail Pointer Network BERT Weaponry

Received: 21 November 2022 Published: 16 May 2023

ZTFLH:	TP391
	G250

Fund:National Natural Science Foundation of China(62171043);Key Laboratories for National Defense Science and Technology(6412006200404);Beijing Municipal Natural Science Foundation(4212020)

Corresponding Authors: You Xindong，ORCID：0000-0002-3351-4599，E-mail：youxindong@bistu.edu.cn。

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Xueqiang Lyu
	Yuting Yang
	Gang Xiao
	Yuxian Li
	Xindong You

Cite this article:

Lyu Xueqiang, Yang Yuting, Xiao Gang, Li Yuxian, You Xindong. Extracting Long Terms from Sparse Samples. Data Analysis and Knowledge Discovery, 2024, 8(1): 135-145.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1231 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I1/135

The Term Recognition Model Based on Head and Tail Pointer

Active Learning Process

Distribution of Data Length

Distribution of Term Length

Training Environment

Comparison of Extraction Models

Sample Comparison

Comparative Experiment of Active Learning Algorithms

Experimental Results

Comparative Experimental Results

Impact of Sample Data Size on Model Performance

[1]	胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[1]	(Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 7-17.)
[2]	李思良, 许斌, 杨玉基. DRTE: 面向基础教育的术语抽取方法[J]. 中文信息学报, 2018, 32(3): 101-109.
[2]	(Li Siliang, Xu Bin, Yang Yuji. DRTE: A Term Extraction Method for K12 Education[J]. Journal of Chinese Information Processing, 2018, 32(3): 101-109.)
[3]	Aria M, Cuccurullo C, Gnasso A. A Comparison Among Interpretative Proposals for Random Forests[J]. Machine Learning with Applications, 2021, 6: Article No.100094.
[4]	吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
[4]	(Wu Jun, Cheng Yao, Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418.)
[5]	Staudemeyer R C, Morris E R. Understanding LSTM——A Tutorial into Long Short-Term Memory Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1909.09586.
[6]	Ren L, Cheng X J, Wang X K, et al. Multi-Scale Dense Gate Recurrent Unit Networks for Bearing Remaining Useful Life Prediction[J]. Future Generation Computer Systems, 2019, 94: 601-609. doi: 10.1016/j.future.2018.12.009
[7]	赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9): 923-938.
[7]	(Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9): 923-938.)
[8]	Kucza M, Niehues J, Zenkel T, et al. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks[C]// Proceedings of the Interspeech 2018. 2018: 2072-2076.
[9]	王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023(1): 49-57.
[9]	(Wang Haochang, Liu Ruyi. Review of Relation Extraction Based on Pre-Training Language Model[J]. Computer and Modernization, 2023(1): 49-57.)
[10]	刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[10]	(Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[11]	张乐, 唐亮, 易绵竹. 融合多策略的军事领域中文术语抽取研究[J]. 现代计算机, 2020(26): 9-16.
[11]	(Zhang Le, Tang Liang, Yi Mianzhu. Research on the Extraction of Military Chinese Terminology Integrating Multi-Strategies[J]. Modern Computer, 2020(26): 9-16.)
[12]	Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]// Proceedings of the 20th National Conference on Artificial Intelligence. AAAI Press, 2005: 746-751.
[13]	Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]// Proceedings of the 4th International Symposium on Intelligent Data Analysis. 2001: 309-318.
[14]	Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 1707.05928.
[15]	胡佳慧, 赵琬清, 方安, 等. 基于主动学习的中文电子病历命名实体识别研究[J]. 中国数字医学, 2020, 15(11): 6-9.
[15]	(Hu Jiahui, Zhao Wanqing, Fang An, et al. A Study of Chinese Electronic Medical Record Named Entity Recognition Based on Active Learning[J]. China Digital Medicine, 2020, 15(11): 6-9.)
[16]	俞敬松, 吴聪, 曹喜信. 政府公文领域细粒度命名实体识别的实用化研究与设计[J]. 微纳电子与智能制造, 2020, 2(3): 23-29.
[16]	(Yu Jingsong, Wu Cong, Cao Xixin. Research on Fine-Grained Named Entity Recognition in Government Documents Based on Deep Active Learning[J]. Micro/Nano Electronics and Intelligent Manufacturing, 2020, 2(3): 23-29.)
[17]	尹学振. 多神经网络协作的军事领域命名实体识别关键技术研究[D]. 上海: 华东师范大学, 2020.
[17]	(Yin Xuezhen. Chinese Military Named Entity Recognition Using Multi-Neural Network Collaboration[D]. Shanghai: East China Normal University, 2020.)
[18]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[19]	吴炳潮, 邓成龙, 关贝, 等. 动态迁移实体块信息的跨领域中文实体识别模型[J]. 软件学报, 2022, 33(10): 3776-3792.
[19]	(Wu Bingchao, Deng Chenglong, Guan Bei, et al. Dynamically Transfer Entity Span Information for Cross-Domain Chinese Named Entity Recognition[J]. Journal of Software, 2022, 33(10): 3776-3792.)

[1]	He Chaocheng, Huang Qian, Li Xinru, Wang Chunying, Wu Jiang. Trending Topics on Metaverse: A Microblog Text Analysis with BERT and DTM[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2]	Zhao Xuefeng, Wu Delin, Wu Weiwei, Sun Zhuoluo, Hu Jinjin, Lian Ying, Shan Jiayu. Identifying High-Quality Technology Patents Based on Deep Learning and Multi-Category Polling Mechanism——Case Study of Patent Applications[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3]	Ben Yanyan, Pang Xueqin. Identifying Medical Named Entities with Word Information[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[4]	Xu Kang, Yu Shengnan, Chen Lei, Wang Chuandong. Linguistic Knowledge-Enhanced Self-Supervised Graph Convolutional Network for Event Relation Extraction[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[5]	Su Mingxing, Wu Houyue, Li Jian, Huang Ju, Zhang Shunxiang. AEMIA:Extracting Commodity Attributes Based on Multi-level Interactive Attention Mechanism[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[6]	Zhao Yiming, Pan Pei, Mao Jin. Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[7]	Wang Yufei, Zhang Zhixiong, Zhao Yang, Zhang Mengting, Li Xuesi. Designing and Implementing Automatic Title Generation System for Sci-Tech Papers[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[8]	Zhang Siyang, Wei Subo, Sun Zhengyan, Zhang Shunxiang, Zhu Guangli, Wu Houyue. Extracting Emotion-Cause Pairs Based on Multi-Label Seq2Seq Model[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[9]	Lyu Xueqiang, Du Yifan, Zhang Le, Pan Huiping, Tian Chi. GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features[J]. 数据分析与知识发现, 2023, 7(12): 155-163.
[10]	Wu Xuxu, Chen Peng, Jiang Huan. Micro-Blog Fine-Grained Sentiment Analysis Based on Multi-Feature Fusion[J]. 数据分析与知识发现, 2023, 7(12): 102-113.
[11]	Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[12]	Li Nan, Wang Bo. Recognition and Visual Analysis of Interdisciplinary Semantic Drift[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[13]	Pan Xiaoyu, Ni Yuan, Jin Chunhua, Zhang Jian. Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[14]	Shi Yunmei, Yuan Bo, Zhang Le, Lv Xueqiang. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[15]	Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.

Viewed

Full text

Abstract

Cited

Shared

Discussed