Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 128-141    DOI: 10.11925/infotech.2096-3467.2022.0261
Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples
Liu Xingli1,2(),Fan Junjie2,Ma Haiqun1
1Research Center of Information Resources Management, Heilongjiang University, Harbin 150080, China
2School of Computer and Information Engineering, Heilongjiang University of Science and Technology, Harbin 150020, China
[Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples.


Key wordsData Augment      EDA      Small Sample      Name Entity Recognition     
Received: 27 March 2022      Published: 09 November 2022
ZTFLH:  TP393 G250  
Fund:National Social Science Fund of China(21&ZD336);National Social Science Fund of China(20ATQ004)
Liu Xingli, Fan Junjie, Ma Haiqun. Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples. Data Analysis and Knowledge Discovery, 2022, 6(10): 128-141.

The Framework of Small Sample NER Based on EDA Improved Algorithm
The Improved MDDM Entity Replacement Strategy Design
The Example of Improved Strategy
The Dependency Tree Example of the Corpus
策略 优势 不足
基于多种领域词典混合的实体替换策略 在不调整句式结构的情况下,采用多源领域数据对目标实体进行了大量扩充,丰富了样本质量和数量 由于未调整句式,句子结构无变化,泛化能力欠缺
基于领域语义分类词典的词性替换策略 通过构建领域词性词典保证了替换词在当前环境下的语义依赖问题,完成非目标词的替换
基于语义保护机制的随机删除策略 基于依存句法对目标实现了语义保护,避免了传统随机删除时引发的语义缺失问题;
基于词性保护的随机插入策略 对NER重要的上下文语义特征的保护的情况下,引入合理的噪声;
The Advantage and Disadvantage of the Improved Algorithm Comparison
The Framework of RoBERTa-WWM-BiLSTM-CRF
实体类型 类型名称 数量 实体示例
NAT 国家 2 665 英格兰
PLA 飞机 231 F-22战斗机
VES 舰船 158 诺森伯兰号
WAT 海域 125 波罗的海
STA 95 俄勒冈州
CIT 城市 121 莫斯科
MIS 导弹 70 烈火导弹
RAD 雷达 76 北极超视距雷达
ARM 部队 99 美海军第七舰队
REG 区域 96 纳卡地区
BAS 基地 86 欣登基地
ISL 岛屿 82 西沙群岛
AIR 航母 80 伊丽莎白女王号
AIP 机场 60 喀布尔机场
POR 港口 50 那霸港
The Sample and Annotation Criterion of Domain Entities
配置项 配置
操作系统 Ubuntu
GPU NVIDIA GeForce RTX 2080 Ti
Python 3.6.0
TensorFlow 2.2.0
内存 62GB
显存 11GB
硬盘 200GB
The Experimental Configuration
参数名称 参数值
Batch_size 128
Seq_max_len 256
Dropout 0.4
learning rate 8e-3
LSTM unit 128
epoch 5
optimizer RAdam
Random embedding size 300
Word2vec embedding size 300
Word2vec window 5
Word2vec iter 5
The Experimental Environment
The Experiment of MDDM Entity Replacement Strategy Under Different Parameters
DA Techniques P w - a v g(%) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( S R ) α = 0.4 N = 6 78.3 80.6 79.3
E D A _ s e l f ( M D D M ) α = 0.4 N = 6 78.8 86.2 82.2
E D A _ e x t e r n a l ( M D D M ) α = 0.4 N = 6 80.6 84.4 82.4
The Experimental Evaluation of MDDM Entity Replacement Strategy
The Experiment of DSCD Part-of-Speech Replacement Strategy Under Different Parameters
EDA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R S ) α = 0.4 N = 10 78.89 79.75 79.04
E D A _ ( D S C D ) α = 0.4 N = 10 81.8 85.7 83.6
The Experimental Evaluation of DSCD Part-Speech Replacement Strategy
The Experiment of SPM Random Deletion Strategy Under Different Parameters
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R D ) α = 0.6 N = 10 74.4 81.4 77.5
E D A _ ( S P M ) α = 0.6 N = 10 80.7 84.4 82.0
The Experimental Evaluation of SPM Random Deletion Strategy
The Evaluation of RSI Strategy Under Different Parameters
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( R I ) α = 0.4 N = 10 80.1 81.8 80.8
E D A _ ( R S I ) α = 0.4 N = 10 80.9 86.1 83.3
The Experimental Evaluation of RSI Strategy
DA Techniques P w - a v g ( % ) R w - a v g ( % ) F w - a v g ( % )
E D A _ ( M D D M ) α = 0.4 78.8 86.2 82.2
E D A _ ( R P S ) α = 0.2 79.1 84.1 81.4
E D A _ ( S P M ) α = 0.2 79.4 82.7 80.9
E D A _ ( R S I ) α = 0.2 79.7 83.9 81.6
E D A _ ( M D D M _ R S I ) α M D D M = 0.4 ? ? α R S I = 0.6 79.0 86.6 82.4
? ? E D A _ ( M D D M _ S P M ) α S R D = 0.4 α S P M = 0.3 77.8 84.3 80.7
? E D A _ ( M D D M _ D S C D ) α M D D M = 0.4 α R P S = 0.2 79.9 85.1 82.3
? ? E D A _ ( D S C D _ S P M ) α D S C D = 0.2 ? α S P M = 0.3 77.9 83.6 80.4
? ? ? ? ? E D A _ ( R S I _ R P S _ S P M ) α R S I = 0.6 α R P S = 0.2 α S P M = 0.3 78.0 84.0 80.6
E D A _ ( M D D M _ R P S _ S P M _ R S I ) α M D D M = 0.4 , α R S I = 0.6 , α R P S = 0.2 , α S P M = 0.3 80.2 83.2 81.5
The Experimental Evaluation of Multiple Strategy
PeopleDailyNER_S 206 124 96
PeopleDailyNER_M 324 183 170
PeopleDailyNER_L 661 350 349
PeopleDailyNER_F 16 571 9 722 8 144
The Small Sample Dataset of Name Entities Recognition from PeopleDaily
WeiBo_S 99 103 4 13 6 24 40
WeiBo_M 198 291 14 33 4 73 76
WeiBo_L 313 358 20 49 8 105 132
WeiBo_F 416 522 28 62 17 148 177
The Small Sample Dataset of Name Entities Recognition from WeiBo
Method PeopleDaily WeiBo
S(%) M(%) L(%) F(%) S(%) M(%) L(%) F(%)
No augmentation 69.7 72.0 78.5 90.9 32.3 37.4 41.8 43.6
SR 70.0 71.4 78.7 88.7 29.0 40.1 43.6 43.8
M D D M α = 0.4 72.4 73.4 78.9 90.6 33.45 44.1 43.9 43.2
RS 69.9 67.6 76.3 87.8 33.2 37.7 43.2 41.8
R P S α = 0.4 71.4 74.1 79.8 91.3 32.0 36.9 40.3 40.3
RD 69.2 70.5 79.6 90.2 31.9 40.6 41.7 45.3
S P M α = 0.3 70.1 72.3 74.3 90.5 35.3 40.6 44.2 45.5
RI 64.5 66.6 73.2 90.9 35.4 42.1 39.7 46.1
R S I α = 0.4 71.4 72.2 79.7 91.3 34.9 39.8 42.8 46.2
E D A _ ( M D D M _ D S C D ) α = 0.4 0.2 73.2 72.5 79.5 90.6 37.7 42.5 38.7 42.7
E D A _ ( M D D M _ R P S _ S P M _ R S I ) 70.3 72.2 79.2 91.2 32.8 37.7 43.2 43.9
The Experimental Evaluation of Different Scale Datasets
