Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 128-141    DOI: 10.11925/infotech.2096-3467.2022.0261
Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples
Liu Xingli1,2(),Fan Junjie2,Ma Haiqun1
1Research Center of Information Resources Management, Heilongjiang University, Harbin 150080, China
2School of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150020, China
[Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples.

Received: 27 March 2022      Published: 09 November 2022
Fund:National Social Science Fund of China(21&ZD336);National Social Science Fund of China(20ATQ004)
Corresponding Authors: Liu Xingli,ORCID：0000-0001-6126-9837      E-mail: liuxingli@usth.edu.cn