Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 78-88    DOI: 10.11925/infotech.2096-3467.2019.0034
Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF
Ma Jianxia(),Yuan Hui,Jiang Xiang
The Northwest Institute of Eco-Environment and Resources, Library and Information Center, Chinese Academy of Sciences, Lanzhou 730000, China
Department of Library, Information and Archives Management, University of Chinese Academy of Sciences,Beijing 100190, China
[Objective] This study tries to extract named entities from the text, such as fragile ecological governance technology, implementation site, and implementation time, etc.[Methods] We combined the Bi-LSTM+CRF and feature-based named entity knowledge base to automatically extract needed data from CNKI documents.[Results] For the extraction of entities on ecological governance technology, the P, R and F1 values were 74.34%, 64.04% and 68.81%, respectively. Compared to the classic CRF method, our new model improves the P and F1 values by 9.41% and 4.26%, while the R value was basically the same.[Limitations] The accuracy of Chinese word segmentation tools may affect the performance of our model. More research is needed to study the relationship among entities.[Conclusions] The proposed model could be used for resource and environment information analysis based on fine-grained contents.

Key wordsBi-LSTM+CRF      Text-Mining      Ecological Restoration Technology      Named Entity Recognition     
Received: 08 January 2019      Published: 26 April 2020
ZTFLH:  TP391  
Corresponding Authors: Ma Jianxia     E-mail:

Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 78-88.

Research Framework
The Structure of Bi-LSTM+CRF
任务数据集 文献数量 实体
时间实体数量 地名实体数量 生态治理技术名称数量
训练集 380 66 223 13 276 29 333 23 614
验证/开发集 127 23 396 4 135 11 263 7 998
测试集 127 19 204 4 353 8 375 6 476
合计 634 108 823 21 764 48 971 38 088
Entity Statistics of Dataset
文本集 来源 用于词向量训练的部分
634篇文献题录信息 Title + Keyword + Abstract
634篇文献题录信息 +实体知识库语料 Title + Keyword + Abstract +Time + Place + Tech
Two Sets of Unlabeled Text for Word2Vec Training
触发词类型 举例
左边界词 实施;设置;计划;开展;采用;治理;营造;种植;发展……
右边界词 技术;模式;体系;方案;措施;工程;治理;修复;方法;XX法;组合;结合;改良;利用……
技术类别 植物措施/方法/技术/模式、动物措施、微生物措施、农业措施、工程措施、化学措施、物理措施、管理措施……
触发词 沙障;沙墙;沙沟;栅栏;阻沙、草方格;固沙;防护林;混交;粘合剂;固沙剂;喷播;补播;草膜;补种;封禁;梯田;造林;流沙固定;防沙治沙;水土保持;农林复合经营;节水集水;沙产业;可再生能源利用……
技术名称 草方格沙障;高立式活沙障;飞播治沙技术;铁路防沙治沙技术;划区轮牧;鱼鳞坑状反坡整地;黄土梯田;旱地农田防护林;造林模式;绿洲农林间作;绿洲农田防护林;农田林网化;砂田技术;盐碱地改造;林草复合法;乔灌混交;改变种植作物;草田轮作;机械阻沙:设置挡沙墙、截沙沟、阻沙栅栏、防沙网;沙障固沙:草方格沙障固沙、黏土沙障、砾石沙障、沙袋沙障……
Common Words and Trigger Words of Ecological Governance Technology
技术名称类型 技术名称子类型及表达模式 样例
辅助技术名称识别相关词表 左边界词(LeftWord) 实施、开展、采用…
右边界词(RightWord) 技术、措施、工程…
土壤类型(Agrotype) 沙地、草甸土…
生态系统/生态区(ECO) 黄土高原、高寒草甸…
生态退化类型(EcoDegType) 荒漠化、石漠化……
简单技术名称 触发词(TriggerWords) 沙障、防护林、固沙剂…
技术类别(TCategory) 生物措施、工程措施…
包含其他实体的技术名称 包含地名(Place)和土壤类型(Agrotype)
表达模式:( LeftWord ) + Place + ( Agrotype )+ TriggerWords + ( RightWord )
表达模式:ECO + TriggerWords + (RightWord)
表达模式:(EcoDegType);EcoDegType + (TriggerWords) + RightWord
技术名称短语 LeftWord + TriggerWords + ( RightWord )
TriggerWords + RightWord
TCategory + RightWord
Rules for Named Entity Extraction of Ecological Governance Technology
地名实体类型 地名实体子类型 地名实体表达模式 样例
单独出现的地名实体 无通名用字的地名(Place) Place 中国、山东、青海……
地名实体简称(PlaceAbb) PlaceAbb 京、津、冀……
地名实体别名(PlaceAli) PlaceAli 燕京、大都、首都、宝岛、六朝古都……
辅助地名识别相关词表 左边界词(LeftWord) - 位于、流经、抵达……
右边界词(RightWord) - 南部、西岸、境内……
通名用字(GeneralWord) - 省、市、县、地区、高原、平原、流域……
复合地名短语 后缀式地名 Place + GeneralWord
Place + RightWord
组合式地名 n × [ Place + (GeneralWord) || PlaceAbb || PlaceAli] 甘肃省兰州市东岗路、
并列式地名 n × [ [PlaceAbb || PlaceAli || Place + (GeneralWord) ] +
[PlaceAbb || PlaceAli || Place + (GeneralWord) ] ]
地名介词短语 LeftWord + Place + GeneralWord
LeftWord + Place + ( GeneralWord ) + RightWord
LeftWord + [ PlaceAbb || PlaceAli ]
Rules for Entity Extraction of Place Names
符号 含义 符号 含义
O 非实体 I-Place 地名中间
B-Time 时间开头 B-Tech 技术开头
I-Time 时间中间 I-Tech 技术中间
B-Place 地名开头
Entity Classification and Annotation System
文本中词语 词性 IOB2标注
p O
毛乌素 ns B-Place
沙地 n I-Place
v O
推行 v O
灌草 nw B-Tech
nr I-Tech
相结合 nz I-Tech
p O
…… …… ……
Examples of Entities Annotation
参数 设置
词向量 100维,Word2Vec分布式向量
Bi-LSTM单元数 Num_Units: 256
学习率 Learning_Rate: 0.002
梯度裁剪 Clip: 10
Dropout Dropout_Rate: 0.5;L2_Rate: 0.01(加在全连接层权重上)
句子最大长度 Sequence_Length(Preprocessing.py输出结果中有句子分布)
迭代次数 Nb_Epoch: 200(可提前终止)
dev_size 0.25
Parameters of Bi-LSTM Model Experiment
序号 输入特征 P(%) R(%) F1(%)
Time Place Tech Time Place Tech Time Place Tech
Baseline 62.66 21.69 32.22
61.85 61.88 66.10 30.31 35.40 8.89 40.68 45.04 15.68
w2v-1 70.23 56.19 62.43
79.79 75.12 63.84 66.01 52.18 54.96 72.25 61.58 59.07
w2v-2 70.99 56.41 62.86
79.14 72.05 67.26 62.32 53.19 56.21 69.73 61.20 61.24
w2v-2 +
73.10 59.94 65.87
74.16 75.82 65.04 67.06 63.10 53.78 70.43 68.88 58.04
w2v-2 +MoreData+
74.34 64.04 68.81
78.63 82.02 66.02 72.20 65.71 58.75 75.27 72.97 62.17
Experiment Results of Different Combinations of Elements
The Trend of Model Training Loss
语料 实体数量 时间实体数量 地名实体数量 生态治理技术名称数量
训练集 39 739 7 965 17 599 14 175
验证/开发集 14 037 2 481 6 757 4 799
增加的训练语料 35 843 6 965 16 240 12 638
增加实体词典 7 703 547 6 693 463
Entity Number of Different Training Corpus
The Influence of Increasing Training Corpus on P and F1
研究方法 模型 P(%) R(%) F1(%)
传统机器学习 单独CRF方法 64.93 64.17 64.55
深度学习 Bi-LSTM+CRF 74.34 64.04 68.81
The Results of Bi-LSTM+CRF and CRF
