Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples
Liu Xingli1,2(),Fan Junjie2,Ma Haiqun1
1Research Center of Information Resources Management, Heilongjiang University, Harbin 150080, China 2School of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150020, China
[Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples.
刘兴丽, 范俊杰, 马海群. 面向小样本命名实体识别的数据增强算法改进策略研究*[J]. 数据分析与知识发现, 2022, 6(10): 128-141.
Liu Xingli, Fan Junjie, Ma Haiqun. Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples. Data Analysis and Knowledge Discovery, 2022, 6(10): 128-141.
(Deng Yiyi, Wu Changxing, Wei Yongfeng, et al. A Survey on Named Entity Recognition Based on Deep Learning[J]. Journal of Chinese Information Processing, 2021, 35(9): 30-45.)
[2]
Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901.11196.
(Li He, Liu Jiayu, Li Shiyu, et al. Optimizing Automatic Question Answering System Based on Disease Knowledge Graph[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 115-126.)
(Qian Li, Xie Jing, Chang Zhijun, et al. Designing Smart Knowledge Services with SCI-Tech Big Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(1): 4-14.)
[6]
Nadler B, Srebro N, Srebro N, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511.06709.
[7]
Park D S, Chan W, Zhang Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[OL]. arXiv Preprint, arXiv: 1904.08779.
(Huang Faxiu, Zhang Shijie, Wu Zhihong, et al. Research on Face Recognition Based on Data Augmentation[J]. Computer Technology and Development, 2020, 30(3): 67-72.)
[10]
Shorten C, Khoshgoftaar T M. A Survey on Image Data Augmentation for Deep Learning[J]. Journal of Big Data, 2019, 6: Article No.60.
[11]
Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[OL]. arXiv Preprint, arXiv: 1904.12848.
(Zhang Wei, Wang Hao, Chen Yuetong, et al. Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 167-183.)
(Liu Tong, Liu Chen, Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.)
(Yang He, Yu Hong, Liu Jusheng, et al. Fishery Standard Named Entity Recognition Based on BERT+BiLSTM+CRF Deep Learning Model and Multivariate Combination Data Augmentation[J]. Journal of Dalian Ocean University, 2021, 36(4): 661-669.)
(Bi Jiajing, Li Min, Zheng Ruirui, et al. Research on Training Data Augmentation Methods for Manchu Character Recognition[J]. Journal of Dalian Minzu University, 2018, 20(1): 73-78.)
(Wang Penghui, Li Mingzheng, Li Si. Data Augmentation for Chinese Clinical Named Entity Recognition[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(5): 84-90.)
doi: 10.13190/j.jbupt.2020-032
[18]
Keraghel A, Benabdeslem K, Canitia B. Data Augmentation Process to Improve Deep Learning-Based NER Task in the Automotive Industry Field[C]// Proceedings of the 2020 International Joint Conference on Neural Networks. IEEE, 2020: 1-8.
(Ma Xiaoqin, Guo Xiaohe, Xue Yufeng, et al. Data Augmentation Technology for Named Entity Recognition[J]. Journal of East China Normal University (Natural Science), 2021(5): 14-23.)
[20]
Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2010.11683.
[21]
Chen J A, Wang Z H, Tian R, et al. Local Additivity Based Data Augmentation for Semi-Supervised NER[OL]. arXiv Preprint, arXiv: 2010.01677.
(Liu Weiping, Zhang Bao, Chen Weirong, et al. Military Named Entity Recognition Based on Transfer Representation Learning[J]. Command Information System and Technology, 2020, 11(2): 64-69.)
(Xu Jian, Ruan Gouqing, Li Xiaodong, et al. Transfer Learning Based Few-Shot Learning for Military Name Entity Recognition[C]// Proceedings of the 9th China Command and Control Conference. 2021:288-291.)
[24]
Yadav V, Sharp R, Bethard S. Deep Affix Features Improve Neural Named Entity Recognizers[C]// Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018: 167-172.
[25]
Sabty C, Omar I, Wasfalla F, et al. Data Augmentation Techniques on Arabic Data for Named Entity Recognition[J]. Procedia Computer Science, 2021, 189: 292-299.
doi: 10.1016/j.procs.2021.05.092
(Liu Huanyong. Open Source Military Weaponry Knowledge Graph[DB/OL]. (2020-04-19). [2022-06-16]. http://openkg.cn/dataset/military-weapon-kg.)
[27]
Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 3861-3867.
[28]
Abid F, Li C, Alam M. Multi-Source Social Media Data Sentiment Analysis Using Bidirectional Recurrent Convolutional Neural Networks[J]. Computer Communications, 2020, 157: 102-115.
doi: 10.1016/j.comcom.2020.04.002
[29]
Xie Z, Wang S I, Li J, et al. Data Noising as Smoothing in Neural Network Language Models[OL]. arXiv Preprint, arXiv: 1703.02573.
[30]
Cui Y, Che W, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906.08101.
[31]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
[32]
Graves A, Jaitly N, Mohamed A R. Hybrid Speech Recognition with Deep Bidirectional LSTM[C]// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE: 273-278.
[33]
Lafferty J, Mccallum A, Pereira F, et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
(Ma Mengcheng, Yang Qingwen, Askar Hamdulla, et al. Chinese Named Entity Classification Based on Word2Vec and Conditional Random Fields[J]. Computer Engineering and Design, 2020, 41(9): 2515-2522.)
[35]
樊高月, 宫旭平. 美国全球军事基地览要[M]. 第1版. 北京: 解放军出版社, 2014.
[35]
(Fan Gaoyue, Gong Xuping. The Overview of Global U.S. Military Bases[M]. The1st Edition. Beijing: PLA Publishing House, 2014.)