Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (1): 135-145    DOI: 10.11925/infotech.2096-3467.2022.1231
Current Issue | Archive | Adv Search |
Extracting Long Terms from Sparse Samples
Lyu Xueqiang1,Yang Yuting1,Xiao Gang2,Li Yuxian1,You Xindong1()
1Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
2General Key Laboratory of Complex System Simulation, Institute of Systems Engineering, Academy of Military Sciences, Beijing 100101, China
Download: PDF (1047 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a model combining head and tail pointers with active learning, which addresses the sparse sample issues and helps us identify long terms on weapons. [Methods] Firstly, we used the BERT pre-trained language model to obtain the word vector representation. Then, we extracted the long terms by the head-tail pointer network. Third, we developed a new active learning sampling strategy to select high-quality unlabeled samples. Finally, we iteratively trained the model to reduce its dependence on the data scale. [Results] The F1 value for long term extraction was improved by 0.50%. With the help of active learning post-sampling, we used about 50% high-quality data to achieve the same F1 value with 100% high-quality training data. [Limitations] Due to the limitation of computing power, the data set in this paper was small, and the active learning sampling strategy requires more processing time. [Conclusions] Using head-tail pointer and active learning method can extract long terms effectively and reduce the cost of data annotation.

Key wordsTerm Extraction      Active Learning      Head-to-Tail Pointer Network      BERT      Weaponry     
Received: 21 November 2022      Published: 16 May 2023
ZTFLH:  TP391  
  G250  
Fund:National Natural Science Foundation of China(62171043);Key Laboratories for National Defense Science and Technology(6412006200404);Beijing Municipal Natural Science Foundation(4212020)
Corresponding Authors: You Xindong,ORCID:0000-0002-3351-4599,E-mail:youxindong@bistu.edu.cn。   

Cite this article:

Lyu Xueqiang, Yang Yuting, Xiao Gang, Li Yuxian, You Xindong. Extracting Long Terms from Sparse Samples. Data Analysis and Knowledge Discovery, 2024, 8(1): 135-145.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1231     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I1/135

The Term Recognition Model Based on Head and Tail Pointer
Active Learning Process
Distribution of Data Length
Distribution of Term Length
配置项 配置情况
操作系统 Linux Ubuntu 16.04
CPU Intel(R) Xeon(R) Gold 5118 CPU @2.30GHz(12-Core)
GPU 8*NVIDIA Tesla V100(16GB)
Python 3.8.9
PyTorch 1.7.1
内存 64GB
Training Environment
编号 抽取模型 P(%) R(%) F1(%)
1 CRF 83.31 74.23 78.51
2 Bi-LSTM 61.03 70.63 65.48
3 BERT+Softmax 90.37 87.81 89.07
4 BERT+CRF 91.26 88.49 89.85
5 BERT+头尾指针网络 91.57 89.16 90.35
Comparison of Extraction Models
编号 原句 CRF抽取结果 头尾指针抽取结果
1 盾牌级导弹艇-(Skjold)舰宽是13.5米。 盾牌级导弹艇 盾牌级导弹艇-(skjold)
2 智利就为其改装了以色列的“巴拉克”防空导弹垂直发射系统。 发射系统 “巴拉克”防空导弹垂直发射系统
3 054a型(北约称江凯ii级)导弹护卫舰,由沪东造船厂建造成。 054a型(北约称江凯ii级),护卫舰 054a型(北约称江凯ii级)导弹护卫舰
4 “四平”号(544)2座b515型324毫米反潜鱼雷三联发射器(携带a244/s型轻型反潜鱼雷24枚)。 “四平”号(544),b515型324毫米反潜鱼雷三联发射器(携带,a244/s型轻型反潜鱼雷24枚) “四平”号(544),b515型324毫米反潜鱼雷三联发射器,a244/s型轻型反潜鱼雷
Sample Comparison
编号 采样算法 说明
1 Random 随机对未标注样本进行采样
2 LC 最小置信度采样算法
3 MNLP 对LC归一化的采样算法
4 Margin 字符粒度的Margin采样算法
5 Margin+Diff(static) 字符粒度的Margin采样算法采样部分数据,术语头尾数量差值不为0的序列采样部分数据,每轮的采样数量固定不变
6 Margin+Diff(dynamic) 字符粒度的Margin采样算法采样部分数据,术语头尾数量差值不为0的序列采样部分数据,每轮的采样数量根据模型预测结果变化
Comparative Experiment of Active Learning Algorithms
模型 各采样轮次时的F1值(%)
1 2 3 4 5 6 7 8
Random 82.93 85.02 85.68 86.15 86.53 85.78 84.64 85.71
LC 84.94 87.37 87.70 87.97 88.69 89.45 90.01 90.27
MNLP 84.46 87.41 87.46 89.32 87.85 87.94 88.71 90.26
Margin 84.88 88.25 87.25 87.93 88.31 89.43 89.33 90.05
Margin+Diff(static) 83.06 86.98 87.79 88.30 88.29 88.84 89.60 90.02
Margin+Diff(dynamic) 87.02(↑) 88.31(↑) 88.76(↑) 88.61(↓) 88.63(↑) 89.86(↑) 90.79(↑) 90.53(↓)
Experimental Results
Comparative Experimental Results
模型 采样8轮时的F1(%)
1%数据比例 5%数据比例 10%数据比例 15%数据比例 20%数据比例
Random 41.67 69.89 78.43 79.82 80.56
LC 38.11 66.67 77.27 80.41 81.05
MNLP 35.40 68.09 76.83 79.39 80.05
Margin 31.76 68.82 74.30 80.28 81.20
Margin+Diff(static) 35.58 71.41 73.29 79.52 80.33
Margin+Diff(dynamic) 36.74(↑) 67.08(↑) 79.18(↑) 81.94(↑) 82.11(↑)
Impact of Sample Data Size on Model Performance
[1] 胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[1] (Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 7-17.)
[2] 李思良, 许斌, 杨玉基. DRTE: 面向基础教育的术语抽取方法[J]. 中文信息学报, 2018, 32(3): 101-109.
[2] (Li Siliang, Xu Bin, Yang Yuji. DRTE: A Term Extraction Method for K12 Education[J]. Journal of Chinese Information Processing, 2018, 32(3): 101-109.)
[3] Aria M, Cuccurullo C, Gnasso A. A Comparison Among Interpretative Proposals for Random Forests[J]. Machine Learning with Applications, 2021, 6: Article No.100094.
[4] 吴俊, 程垚, 郝瀚, 等. 基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究[J]. 情报学报, 2020, 39(4): 409-418.
[4] (Wu Jun, Cheng Yao, Hao Han, et al. Automatic Extraction of Chinese Terminology Based on BERT Embedding and BiLSTM-CRF Model[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(4): 409-418.)
[5] Staudemeyer R C, Morris E R. Understanding LSTM——A Tutorial into Long Short-Term Memory Recurrent Neural Networks[OL]. arXiv Preprint, arXiv: 1909.09586.
[6] Ren L, Cheng X J, Wang X K, et al. Multi-Scale Dense Gate Recurrent Unit Networks for Bearing Remaining Useful Life Prediction[J]. Future Generation Computer Systems, 2019, 94: 601-609.
doi: 10.1016/j.future.2018.12.009
[7] 赵洪, 王芳. 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018, 37(9): 923-938.
[7] (Zhao Hong, Wang Fang. A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(9): 923-938.)
[8] Kucza M, Niehues J, Zenkel T, et al. Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks[C]// Proceedings of the Interspeech 2018. 2018: 2072-2076.
[9] 王浩畅, 刘如意. 基于预训练模型的关系抽取研究综述[J]. 计算机与现代化, 2023(1): 49-57.
[9] (Wang Haochang, Liu Ruyi. Review of Relation Extraction Based on Pre-Training Language Model[J]. Computer and Modernization, 2023(1): 49-57.)
[10] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[10] (Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12): 68-75.)
[11] 张乐, 唐亮, 易绵竹. 融合多策略的军事领域中文术语抽取研究[J]. 现代计算机, 2020(26): 9-16.
[11] (Zhang Le, Tang Liang, Yi Mianzhu. Research on the Extraction of Military Chinese Terminology Integrating Multi-Strategies[J]. Modern Computer, 2020(26): 9-16.)
[12] Culotta A, McCallum A. Reducing Labeling Effort for Structured Prediction Tasks[C]// Proceedings of the 20th National Conference on Artificial Intelligence. AAAI Press, 2005: 746-751.
[13] Scheffer T, Decomain C, Wrobel S. Active Hidden Markov Models for Information Extraction[C]// Proceedings of the 4th International Symposium on Intelligent Data Analysis. 2001: 309-318.
[14] Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 1707.05928.
[15] 胡佳慧, 赵琬清, 方安, 等. 基于主动学习的中文电子病历命名实体识别研究[J]. 中国数字医学, 2020, 15(11): 6-9.
[15] (Hu Jiahui, Zhao Wanqing, Fang An, et al. A Study of Chinese Electronic Medical Record Named Entity Recognition Based on Active Learning[J]. China Digital Medicine, 2020, 15(11): 6-9.)
[16] 俞敬松, 吴聪, 曹喜信. 政府公文领域细粒度命名实体识别的实用化研究与设计[J]. 微纳电子与智能制造, 2020, 2(3): 23-29.
[16] (Yu Jingsong, Wu Cong, Cao Xixin. Research on Fine-Grained Named Entity Recognition in Government Documents Based on Deep Active Learning[J]. Micro/Nano Electronics and Intelligent Manufacturing, 2020, 2(3): 23-29.)
[17] 尹学振. 多神经网络协作的军事领域命名实体识别关键技术研究[D]. 上海: 华东师范大学, 2020.
[17] (Yin Xuezhen. Chinese Military Named Entity Recognition Using Multi-Neural Network Collaboration[D]. Shanghai: East China Normal University, 2020.)
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[19] 吴炳潮, 邓成龙, 关贝, 等. 动态迁移实体块信息的跨领域中文实体识别模型[J]. 软件学报, 2022, 33(10): 3776-3792.
[19] (Wu Bingchao, Deng Chenglong, Guan Bei, et al. Dynamically Transfer Entity Span Information for Cross-Domain Chinese Named Entity Recognition[J]. Journal of Software, 2022, 33(10): 3776-3792.)
[1] He Chaocheng, Huang Qian, Li Xinru, Wang Chunying, Wu Jiang. Trending Topics on Metaverse: A Microblog Text Analysis with BERT and DTM[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2] Zhao Xuefeng, Wu Delin, Wu Weiwei, Sun Zhuoluo, Hu Jinjin, Lian Ying, Shan Jiayu. Identifying High-Quality Technology Patents Based on Deep Learning and Multi-Category Polling Mechanism——Case Study of Patent Applications[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3] Ben Yanyan, Pang Xueqin. Identifying Medical Named Entities with Word Information[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[4] Xu Kang, Yu Shengnan, Chen Lei, Wang Chuandong. Linguistic Knowledge-Enhanced Self-Supervised Graph Convolutional Network for Event Relation Extraction[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[5] Su Mingxing, Wu Houyue, Li Jian, Huang Ju, Zhang Shunxiang. AEMIA:Extracting Commodity Attributes Based on Multi-level Interactive Attention Mechanism[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[6] Zhao Yiming, Pan Pei, Mao Jin. Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[7] Wang Yufei, Zhang Zhixiong, Zhao Yang, Zhang Mengting, Li Xuesi. Designing and Implementing Automatic Title Generation System for Sci-Tech Papers[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[8] Zhang Siyang, Wei Subo, Sun Zhengyan, Zhang Shunxiang, Zhu Guangli, Wu Houyue. Extracting Emotion-Cause Pairs Based on Multi-Label Seq2Seq Model[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[9] Lyu Xueqiang, Du Yifan, Zhang Le, Pan Huiping, Tian Chi. GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features[J]. 数据分析与知识发现, 2023, 7(12): 155-163.
[10] Wu Xuxu, Chen Peng, Jiang Huan. Micro-Blog Fine-Grained Sentiment Analysis Based on Multi-Feature Fusion[J]. 数据分析与知识发现, 2023, 7(12): 102-113.
[11] Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[12] Li Nan, Wang Bo. Recognition and Visual Analysis of Interdisciplinary Semantic Drift[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
[13] Pan Xiaoyu, Ni Yuan, Jin Chunhua, Zhang Jian. Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[14] Shi Yunmei, Yuan Bo, Zhang Le, Lv Xueqiang. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[15] Wu Jiang, Liu Tao, Liu Yang. Mining Online User Profiles and Self-Presentations: Case Study of NetEase Music Community[J]. 数据分析与知识发现, 2022, 6(7): 56-69.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn