|
|
Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples |
Liu Xingli1,2( ),Fan Junjie2,Ma Haiqun1 |
1Research Center of Information Resources Management, Heilongjiang University, Harbin 150080, China 2School of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150020, China |
|
|
Abstract [Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples. Method
|
Received: 27 March 2022
Published: 09 November 2022
|
|
Fund:National Social Science Fund of China(21&ZD336);National Social Science Fund of China(20ATQ004) |
Corresponding Authors:
Liu Xingli,ORCID:0000-0001-6126-9837
E-mail: liuxingli@usth.edu.cn
|
[1] |
邓依依, 邬昌兴, 魏永丰, 等. 基于深度学习的命名实体识别综述[J]. 中文信息学报, 2021, 35(9): 30-45.
|
[1] |
(Deng Yiyi, Wu Changxing, Wei Yongfeng, et al. A Survey on Named Entity Recognition Based on Deep Learning[J]. Journal of Chinese Information Processing, 2021, 35(9): 30-45.)
|
[2] |
Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901.11196.
|
[3] |
肖中华. 一种基于群智的语料库数据标注方法及系统:中国,CN108874763A[P]. 2018-11-23[2022-08-14].
|
[3] |
(Xiao Zhonghua. Corpus Data Labeling Method Based on Swarm Intelligence: China, CN108874763A[P]. 2018-11-23[2022-08-14].)
|
[4] |
李贺, 刘嘉宇, 李世钰, 等. 基于疾病知识图谱的自动问答系统优化研究[J]. 数据分析与知识发现, 2021, 5(5): 115-126.
|
[4] |
(Li He, Liu Jiayu, Li Shiyu, et al. Optimizing Automatic Question Answering System Based on Disease Knowledge Graph[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 115-126.)
|
[5] |
钱力, 谢靖, 常志军, 等. 基于科技大数据的智能知识服务体系研究设计[J]. 数据分析与知识发现, 2019, 3(1): 4-14.
|
[5] |
(Qian Li, Xie Jing, Chang Zhijun, et al. Designing Smart Knowledge Services with SCI-Tech Big Data[J]. Data Analysis and Knowledge Discovery, 2019, 3(1): 4-14.)
|
[6] |
Nadler B, Srebro N, Srebro N, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511.06709.
|
[7] |
Park D S, Chan W, Zhang Y, et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition[OL]. arXiv Preprint, arXiv: 1904.08779.
|
[8] |
冯晓硕, 沈樾, 王冬琦. 基于图像的数据增强方法发展现状综述[J]. 计算机科学与应用, 2021(2): 370-382.
|
[8] |
(Feng Xiaoshuo, Shen Yue, Wang Dongqi. A Survey on the Development of Image Data Augmentation[J]. Computer Science and Application, 2021(2): 370-382.)
|
[9] |
黄法秀, 张世杰, 吴志红, 等. 数据增广下的人脸识别研究[J]. 计算机技术与发展, 2020, 30(3): 67-72.
|
[9] |
(Huang Faxiu, Zhang Shijie, Wu Zhihong, et al. Research on Face Recognition Based on Data Augmentation[J]. Computer Technology and Development, 2020, 30(3): 67-72.)
|
[10] |
Shorten C, Khoshgoftaar T M. A Survey on Image Data Augmentation for Deep Learning[J]. Journal of Big Data, 2019, 6: Article No.60.
|
[11] |
Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[OL]. arXiv Preprint, arXiv: 1904.12848.
|
[12] |
张卫, 王昊, 陈玥彤, 等. 融合迁移学习与文本增强的中文成语隐喻知识识别与关联研究[J]. 数据分析与知识发现, 2022, 6(2/3): 167-183.
|
[12] |
(Zhang Wei, Wang Hao, Chen Yuetong, et al. Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 167-183.)
|
[13] |
刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
|
[13] |
(Liu Tong, Liu Chen, Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.)
|
[14] |
李健, 张克亮, 唐亮, 等. 面向中文命名实体识别任务的数据增强[J]. 计算机与现代化, 2022(4): 1-6.
|
[14] |
(Li Jian, Zhang Keliang, Tang Liang, et al. Data Augmentation for Chinese Named Entity Recognition Task[J]. Computer and Modernization, 2022(4): 1-6.)
|
[15] |
杨鹤, 于红, 刘巨升, 等. 基于BERT+BiLSTM+CRF深度学习模型和多元组合数据增广的渔业标准命名实体识别[J]. 大连海洋大学学报, 2021, 36(4): 661-669.
|
[15] |
(Yang He, Yu Hong, Liu Jusheng, et al. Fishery Standard Named Entity Recognition Based on BERT+BiLSTM+CRF Deep Learning Model and Multivariate Combination Data Augmentation[J]. Journal of Dalian Ocean University, 2021, 36(4): 661-669.)
|
[16] |
毕佳晶, 李敏, 郑蕊蕊, 等. 面向满文字符识别的训练数据增广方法研究[J]. 大连民族大学学报, 2018, 20(1): 73-78.
|
[16] |
(Bi Jiajing, Li Min, Zheng Ruirui, et al. Research on Training Data Augmentation Methods for Manchu Character Recognition[J]. Journal of Dalian Minzu University, 2018, 20(1): 73-78.)
|
[17] |
王蓬辉, 李明正, 李思. 基于数据增强的中文医疗命名实体识别[J]. 北京邮电大学学报, 2020, 43(5): 84-90.
doi: 10.13190/j.jbupt.2020-032
|
[17] |
(Wang Penghui, Li Mingzheng, Li Si. Data Augmentation for Chinese Clinical Named Entity Recognition[J]. Journal of Beijing University of Posts and Telecommunications, 2020, 43(5): 84-90.)
doi: 10.13190/j.jbupt.2020-032
|
[18] |
Keraghel A, Benabdeslem K, Canitia B. Data Augmentation Process to Improve Deep Learning-Based NER Task in the Automotive Industry Field[C]// Proceedings of the 2020 International Joint Conference on Neural Networks. IEEE, 2020: 1-8.
|
[19] |
马晓琴, 郭小鹤, 薛峪峰, 等. 针对命名实体识别的数据增强技术[J]. 华东师范大学学报(自然科学版), 2021(5): 14-23.
|
[19] |
(Ma Xiaoqin, Guo Xiaohe, Xue Yufeng, et al. Data Augmentation Technology for Named Entity Recognition[J]. Journal of East China Normal University (Natural Science), 2021(5): 14-23.)
|
[20] |
Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2010.11683.
|
[21] |
Chen J A, Wang Z H, Tian R, et al. Local Additivity Based Data Augmentation for Semi-Supervised NER[OL]. arXiv Preprint, arXiv: 2010.01677.
|
[22] |
刘卫平, 张豹, 陈伟荣, 等. 基于迁移表示学习的军事命名实体识别[J]. 指挥信息系统与技术, 2020, 11(2): 64-69.
|
[22] |
(Liu Weiping, Zhang Bao, Chen Weirong, et al. Military Named Entity Recognition Based on Transfer Representation Learning[J]. Command Information System and Technology, 2020, 11(2): 64-69.)
|
[23] |
徐建, 阮国庆, 李晓冬, 等. 基于迁移学习的小样本军事文本命名实体识别[C]. 见: 第九界中国指挥控制大会论文集. 2021: 288-291.
|
[23] |
(Xu Jian, Ruan Gouqing, Li Xiaodong, et al. Transfer Learning Based Few-Shot Learning for Military Name Entity Recognition[C]// Proceedings of the 9th China Command and Control Conference. 2021:288-291.)
|
[24] |
Yadav V, Sharp R, Bethard S. Deep Affix Features Improve Neural Named Entity Recognizers[C]// Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 2018: 167-172.
|
[25] |
Sabty C, Omar I, Wasfalla F, et al. Data Augmentation Techniques on Arabic Data for Named Entity Recognition[J]. Procedia Computer Science, 2021, 189: 292-299.
doi: 10.1016/j.procs.2021.05.092
|
[26] |
刘焕勇. 开源军事武器装备知识图谱[EB/OL].(2020-04-19). [2022-06-16]. http://openkg.cn/dataset/military-weapon-kg.
|
[26] |
(Liu Huanyong. Open Source Military Weaponry Knowledge Graph[DB/OL]. (2020-04-19). [2022-06-16]. http://openkg.cn/dataset/military-weapon-kg.)
|
[27] |
Dai X, Adel H. An Analysis of Simple Data Augmentation for Named Entity Recognition[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 3861-3867.
|
[28] |
Abid F, Li C, Alam M. Multi-Source Social Media Data Sentiment Analysis Using Bidirectional Recurrent Convolutional Neural Networks[J]. Computer Communications, 2020, 157: 102-115.
doi: 10.1016/j.comcom.2020.04.002
|
[29] |
Xie Z, Wang S I, Li J, et al. Data Noising as Smoothing in Neural Network Language Models[OL]. arXiv Preprint, arXiv: 1703.02573.
|
[30] |
Cui Y, Che W, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906.08101.
|
[31] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 5998-6008.
|
[32] |
Graves A, Jaitly N, Mohamed A R. Hybrid Speech Recognition with Deep Bidirectional LSTM[C]// Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE: 273-278.
|
[33] |
Lafferty J, Mccallum A, Pereira F, et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
|
[34] |
马孟铖, 杨晴雯, 艾斯卡尔·艾木都拉, 等. 基于词向量和条件随机场的中文命名实体分类[J]. 计算机工程与设计, 2020, 41(9): 2515-2522.
|
[34] |
(Ma Mengcheng, Yang Qingwen, Askar Hamdulla, et al. Chinese Named Entity Classification Based on Word2Vec and Conditional Random Fields[J]. Computer Engineering and Design, 2020, 41(9): 2515-2522.)
|
[35] |
樊高月, 宫旭平. 美国全球军事基地览要[M]. 第1版. 北京: 解放军出版社, 2014.
|
[35] |
(Fan Gaoyue, Gong Xuping. The Overview of Global U.S. Military Bases[M]. The1st Edition. Beijing: PLA Publishing House, 2014.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|