Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 128-141    DOI: 10.11925/infotech.2096-3467.2022.0261
Current Issue | Archive | Adv Search |
Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples
Liu Xingli1,2(),Fan Junjie2,Ma Haiqun1
1Research Center of Information Resources Management, Heilongjiang University, Harbin 150080, China
2School of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150020, China
Download: PDF (1778 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples.

Method

Key wordsData Augment      EDA      Small Sample      Name Entity Recognition     
Received: 27 March 2022      Published: 09 November 2022
ZTFLH:  TP393 G250  
Fund:National Social Science Fund of China(21&ZD336);National Social Science Fund of China(20ATQ004)
Corresponding Authors: Liu Xingli,ORCID:0000-0001-6126-9837      E-mail: liuxingli@usth.edu.cn

Cite this article:

Liu Xingli, Fan Junjie, Ma Haiqun. Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples. Data Analysis and Knowledge Discovery, 2022, 6(10): 128-141.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0261     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I10/128

The Framework of Small Sample NER Based on EDA Improved Algorithm
The Improved MDDM Entity Replacement Strategy Design
The Example of Improved Strategy
The Dependency Tree Example of the Corpus
策略 优势 不足
基于多种领域词典混合的实体替换策略 在不调整句式结构的情况下,采用多源领域数据对目标实体进行了大量扩充,丰富了样本质量和数量 由于未调整句式,句子结构无变化,泛化能力欠缺
基于领域语义分类词典的词性替换策略 通过构建领域词性词典保证了替换词在当前环境下的语义依赖问题,完成非目标词的替换
通过相似度计算使替换词更精准,词义上贴近原词
未补充目标实体的语料;
小样本领域词性词典不够丰富使候选词不够精准;
基于语义保护机制的随机删除策略 基于依存句法对目标实现了语义保护,避免了传统随机删除时引发的语义缺失问题;
改变了原有的句式,增强了模型的泛化性和鲁棒性
未能补充目标实体,不能丰富语料;
可能存在语义保护过度,导致句式变化有限
基于词性保护的随机插入策略 对NER重要的上下文语义特征的保护的情况下,引入合理的噪声;
调整了句式结构,增强了模型的泛化性和鲁棒性
未能补充目标实体,不能丰富语料;
噪声可能影响模型效果
The Advantage and Disadvantage of the Improved Algorithm Comparison
The Framework of RoBERTa-WWM-BiLSTM-CRF
实体类型 类型名称 数量 实体示例
NAT 国家 2 665 英格兰
PLA 飞机 231 F-22战斗机
VES 舰船 158 诺森伯兰号
WAT 海域 125 波罗的海
STA 95 俄勒冈州
CIT 城市 121 莫斯科
MIS 导弹 70 烈火导弹
RAD 雷达 76 北极超视距雷达
ARM 部队 99 美海军第七舰队
REG 区域 96 纳卡地区
BAS 基地 86 欣登基地
ISL 岛屿 82 西沙群岛
AIR 航母 80 伊丽莎白女王号
AIP 机场 60 喀布尔机场
POR 港口 50 那霸港
The Sample and Annotation Criterion of Domain Entities
配置项 配置
操作系统 Ubuntu
GPU NVIDIA GeForce RTX 2080 Ti
Python 3.6.0
TensorFlow 2.2.0
内存 62GB
显存 11GB
硬盘 200GB
The Experimental Configuration
参数名称 参数值
Batch_size 128
Seq_max_len 256
Dropout 0.4
learning rate 8e-3
LSTM unit 128
epoch 5
optimizer RAdam
Random embedding size 300
Word2vec embedding size 300
Word2vec window 5
Word2vec iter 5
The Experimental Environment
The Experiment of MDDM Entity Replacement Strategy Under Different Parameters
DA Techniques P w - a v g(%) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( S R ) α = 0.4 N = 6 78.3 80.6 79.3
E D A _ s e l f ( M D D M ) α = 0.4 N = 6 78.8 86.2 82.2
E D A _ e x t e r n a l ( M D D M ) α = 0.4 N = 6 80.6 84.4 82.4
The Experimental Evaluation of MDDM Entity Replacement Strategy
The Experiment of DSCD Part-of-Speech Replacement Strategy Under Different Parameters
EDA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R S ) α = 0.4 N = 10 78.89 79.75 79.04
E D A _ ( D S C D ) α = 0.4 N = 10 81.8 85.7 83.6
The Experimental Evaluation of DSCD Part-Speech Replacement Strategy
The Experiment of SPM Random Deletion Strategy Under Different Parameters
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R D ) α = 0.6 N = 10 74.4 81.4 77.5
E D A _ ( S P M ) α = 0.6 N = 10 80.7 84.4 82.0
The Experimental Evaluation of SPM Random Deletion Strategy
The Evaluation of RSI Strategy Under Different Parameters
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R I ) α = 0.4 N = 10 80.1 81.8 80.8
E D A _ ( R S I ) α = 0.4 N = 10 80.9 86.1 83.3
The Experimental Evaluation of RSI Strategy
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( M D D M ) α = 0.4 78.8 86.2 82.2
E D A _ ( R P S ) α = 0.2 79.1 84.1 81.4
E D A _ ( S P M ) α = 0.2 79.4 82.7 80.9
E D A _ ( R S I ) α = 0.2 79.7 83.9 81.6
E D A _ ( M D D M _ R S I ) α M D D M = 0.4 ? ? α R S I = 0.6 79.0 86.6 82.4
? ? E D A _ ( M D D M _ S P M ) α S R D = 0.4 α S P M = 0.3 77.8 84.3 80.7
? E D A _ ( M D D M _ D S C D ) α M D D M = 0.4 α R P S = 0.2 79.9 85.1 82.3
? ? E D A _ ( D S C D _ S P M ) α D S C D = 0.2 ? α S P M = 0.3 77.9 83.6 80.4
? ? ? ? ? E D A _ ( R S I _ R P S _ S P M ) α R S I = 0.6 α R P S = 0.2 α S P M = 0.3 78.0 84.0 80.6
E D A _ ( M D D M _ R P S _ S P M _ R S I ) α M D D M = 0.4 , α R S I = 0.6 , α R P S = 0.2 , α S P M = 0.3 80.2 83.2 81.5
The Experimental Evaluation of Multiple Strategy
Dataset LOC ORG PER
PeopleDailyNER_S 206 124 96
PeopleDailyNER_M 324 183 170
PeopleDailyNER_L 661 350 349
PeopleDailyNER_F 16 571 9 722 8 144
The Small Sample Dataset of Name Entities Recognition from PeopleDaily
Dataset PER.NOM PER.NAM LOC.NOM LOC.NAM ORG.NOM ORG.NAM GPE.NAM
WeiBo_S 99 103 4 13 6 24 40
WeiBo_M 198 291 14 33 4 73 76
WeiBo_L 313 358 20 49 8 105 132
WeiBo_F 416 522 28 62 17 148 177
The Small Sample Dataset of Name Entities Recognition from WeiBo
Method PeopleDaily WeiBo
S(%) M(%) L(%) F(%) S(%) M(%) L(%) F(%)
No augmentation 69.7 72.0 78.5 90.9 32.3 37.4 41.8 43.6
SR 70.0 71.4 78.7 88.7 29.0 40.1 43.6 43.8
M D D M α = 0.4 72.4 73.4 78.9 90.6 33.45 44.1 43.9 43.2
RS 69.9 67.6 76.3 87.8 33.2 37.7 43.2 41.8
R P S α = 0.4 71.4 74.1 79.8 91.3 32.0 36.9 40.3 40.3
RD 69.2 70.5 79.6 90.2 31.9 40.6 41.7 45.3
S P M α = 0.3 70.1 72.3 74.3 90.5 35.3 40.6 44.2 45.5
RI 64.5 66.6 73.2 90.9 35.4 42.1 39.7 46.1
R S I α = 0.4 71.4 72.2 79.7 91.3 34.9 39.8 42.8 46.2
E D A _ ( M D D M _ D S C D ) α = 0.4 0.2 73.2 72.5 79.5 90.6 37.7 42.5 38.7 42.7
E D A _ ( M D D M _ R P S _ S P M _ R S I ) 70.3 72.2 79.2 91.2 32.8 37.7 43.2 43.9
The Experimental Evaluation of Different Scale Datasets
[1] 邓依依, 邬昌兴, 魏永丰, 等. 基于深度学习的命名实体识别综述[J]. 中文信息学报, 2021, 35(9): 30-45.
[1] (Deng Yiyi, Wu Changxing, Wei Yongfeng, et al. A Survey on Named Entity Recognition Based on Deep Learning[J]. Journal of Chinese Information Processing, 2021, 35(9): 30-45.)
[2] Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901.11196.
[3] 肖中华. 一种基于群智的语料库数据标注方法及系统:中国,CN108874763A[P]. 2018-11-23[2022-08-14].
[3] (Xiao Zhonghua. Corpus Data Labeling Method Based on Swarm Intelligence: China, CN108874763A[P]. 2018-11-23[2022-08-14].)
[4] 李贺, 刘嘉宇, 李世钰, 等. 基于疾病知识图谱的自动问答系统优化研究[J]. 数据分析与知识发现, 2021, 5(5): 115-126.
[4] (Li He, Liu Jiayu, Li Shiyu, et al. Optimizing Automatic Question Answering System Based on Disease Knowledge Graph[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 115-126.)
[5] 钱力, 谢靖, 常志军, 等. 基于科技大数据的智能知识服务体系研究设计[J]. 数据分析与知识发现, 2019, 3(1): 4-14.
[5] (Qian Li, Xie Jing, Chang Zhijun, et al. Designing Smart Knowledge Services with SCI-Tech Big Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(1): 4-14.)
[6] Nadler B, Srebro N, Srebro N, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511.06709.
[7] Park D S, Chan W, Zhang Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[OL]. arXiv Preprint, arXiv: 1904.08779.
[8] 冯晓硕, 沈樾, 王冬琦. 基于图像的数据增强方法发展现状综述[J]. 计算机科学与应用, 2021(2): 370-382.
[8] (Feng Xiaoshuo, Shen Yue, Wang Dongqi. A Survey on the Development of Image Data Augmentation[J]. Computer Science and Application, 2021(2): 370-382.)
[9] 黄法秀, 张世杰, 吴志红, 等. 数据增广下的人脸识别研究[J]. 计算机技术与发展, 2020, 30(3): 67-72.
[9] (Huang Faxiu, Zhang Shijie, Wu Zhihong, et al. Research on Face Recognition Based on Data Augmentation[J]. Computer Technology and Development, 2020, 30(3): 67-72.)
[10] Shorten C, Khoshgoftaar T M. A Survey on Image Data Augmentation for Deep Learning[J]. Journal of Big Data, 2019, 6: Article No.60.
[11] Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[OL]. arXiv Preprint, arXiv: 1904.12848.
[12] 张卫, 王昊, 陈玥彤, 等. 融合迁移学习与文本增强的中文成语隐喻知识识别与关联研究[J]. 数据分析与知识发现, 2022, 6(2/3): 167-183.
[12] (Zhang Wei, Wang Hao, Chen Yuetong, et al. Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 167-183.)
[13] 刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[13] (Liu Tong, Liu Chen, Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.)
[14] 李健, 张克亮, 唐亮, 等. 面向中文命名实体识别任务的数据增强[J]. 计算机与现代化, 2022(4): 1-6.
[14] (Li Jian, Zhang Keliang, Tang Liang, et al. Data Augmentation for Chinese Named Entity Recognition Task[J]. Computer and Modernization, 2022(4): 1-6.)
[15] 杨鹤, 于红, 刘巨升, 等. 基于BERT+BiLSTM+CRF深度学习模型和多元组合数据增广的渔业标准命名实体识别[J]. 大连海洋大学学报, 2021, 36(4): 661-669.
[15] (Yang He, Yu Hong, Liu Jusheng, et al. Fishery Standard Named Entity Recognition Based on BERT+BiLSTM+CRF Deep Learning Model and Multivariate Combination Data Augmentation[J]. Journal of Dalian Ocean University, 2021, 36(4): 661-669.)
[16] 毕佳晶, 李敏, 郑蕊蕊, 等. 面向满文字符识别的训练数据增广方法研究[J]. 大连民族大学学报, 2018, 20(1): 73-78.
[16] (Bi Jiajing, Li Min, Zheng Ruirui, et al. Research on Training Data Augmentation Methods for Manchu Character Recognition[J]. Journal of Dalian Minzu University, 2018, 20(1): 73-78.)
[17] 王蓬辉, 李明正, 李思. 基于数据增强的中文医疗命名实体识别[J]. 北京邮电大学学报, 2020, 43(5): 84-90.
doi: 10.13190/j.jbupt.2020-032
[17] (Wang Penghui, Li Mingzheng, Li Si. Data Augmentation for Chinese Clinical Named Entity Recognition[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(5): 84-90.)
doi: 10.13190/j.jbupt.2020-032
[18] Keraghel A, Benabdeslem K, Canitia B. Data Augmentation Process to Improve Deep Learning-Based NER Task in the Automotive Industry Field[C]// Proceedings of the 2020 International Joint Conference on Neural Networks. IEEE, 2020: 1-8.
[19] 马晓琴, 郭小鹤, 薛峪峰, 等. 针对命名实体识别的数据增强技术[J]. 华东师范大学学报(自然科学版), 2021(5): 14-23.
[19] (Ma Xiaoqin, Guo Xiaohe, Xue Yufeng, et al. Data Augmentation Technology for Named Entity Recognition[J]. Journal of East China Normal University (Natural Science), 2021(5): 14-23.)
[20] Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2010.11683.
[21] Chen J A, Wang Z H, Tian R, et al. Local Additivity Based Data Augmentation for Semi-Supervised NER[OL]. arXiv Preprint, arXiv: 2010.01677.
[22] 刘卫平, 张豹, 陈伟荣, 等. 基于迁移表示学习的军事命名实体识别[J]. 指挥信息系统与技术, 2020, 11(2): 64-69.
[22] (Liu Weiping, Zhang Bao, Chen Weirong, et al. Military Named Entity Recognition Based on Transfer Representation Learning[J]. Command Information System and Technology, 2020, 11(2): 64-69.)
[23] 徐建, 阮国庆, 李晓冬, 等. 基于迁移学习的小样本军事文本命名实体识别[C]. 见: 第九界中国指挥控制大会论文集. 2021: 288-291.
[23] (Xu Jian, Ruan Gouqing, Li Xiaodong, et al. Transfer Learning Based Few-Shot Learning for Military Name Entity Recognition[C]// Proceedings of the 9th China Command and Control Conference. 2021:288-291.)
[24] Yadav V, Sharp R, Bethard S. Deep Affix Features Improve Neural Named Entity Recognizers[C]// Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018: 167-172.
[25] Sabty C, Omar I, Wasfalla F, et al. Data Augmentation Techniques on Arabic Data for Named Entity Recognition[J]. Procedia Computer Science, 2021, 189: 292-299.
doi: 10.1016/j.procs.2021.05.092
[26] 刘焕勇. 开源军事武器装备知识图谱[EB/OL].(2020-04-19). [2022-06-16]. http://openkg.cn/dataset/military-weapon-kg.
[26] (Liu Huanyong. Open Source Military Weaponry Knowledge Graph[DB/OL]. (2020-04-19). [2022-06-16]. http://openkg.cn/dataset/military-weapon-kg.)
[27] Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 3861-3867.
[28] Abid F, Li C, Alam M. Multi-Source Social Media Data Sentiment Analysis Using Bidirectional Recurrent Convolutional Neural Networks[J]. Computer Communications, 2020, 157: 102-115.
doi: 10.1016/j.comcom.2020.04.002
[29] Xie Z, Wang S I, Li J, et al. Data Noising as Smoothing in Neural Network Language Models[OL]. arXiv Preprint, arXiv: 1703.02573.
[30] Cui Y, Che W, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906.08101.
[31] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[32] Graves A, Jaitly N, Mohamed A R. Hybrid Speech Recognition with Deep Bidirectional LSTM[C]// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE: 273-278.
[33] Lafferty J, Mccallum A, Pereira F, et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[34] 马孟铖, 杨晴雯, 艾斯卡尔·艾木都拉, 等. 基于词向量和条件随机场的中文命名实体分类[J]. 计算机工程与设计, 2020, 41(9): 2515-2522.
[34] (Ma Mengcheng, Yang Qingwen, Askar Hamdulla, et al. Chinese Named Entity Classification Based on Word2Vec and Conditional Random Fields[J]. Computer Engineering and Design, 2020, 41(9): 2515-2522.)
[35] 樊高月, 宫旭平. 美国全球军事基地览要[M]. 第1版. 北京: 解放军出版社, 2014.
[35] (Fan Gaoyue, Gong Xuping. The Overview of Global U.S. Military Bases[M]. The1st Edition. Beijing: PLA Publishing House, 2014.)
[1] Jiang Yaren, Le Xiaoqiu. Continual Learning for One-to-many Entity Relationship Generation with Small Samples[J]. 数据分析与知识发现, 2021, 5(8): 45-53.
[2] Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[3] He Xiaojing,Hui Zhibin,Chen Dandan. On the Standard Format of Metadata of E-Government Affairs Archives[J]. 现代图书情报技术, 2003, 19(6): 80-81.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn