Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 78-88    DOI: 10.11925/infotech.2096-3467.2019.0034
Current Issue | Archive | Adv Search |
Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF
Ma Jianxia(),Yuan Hui,Jiang Xiang
The Northwest Institute of Eco-Environment and Resources, Library and Information Center, Chinese Academy of Sciences, Lanzhou 730000, China
Department of Library, Information and Archives Management, University of Chinese Academy of Sciences,Beijing 100190, China
Download: PDF (949 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to extract named entities from the text, such as fragile ecological governance technology, implementation site, and implementation time, etc.[Methods] We combined the Bi-LSTM+CRF and feature-based named entity knowledge base to automatically extract needed data from CNKI documents.[Results] For the extraction of entities on ecological governance technology, the P, R and F1 values were 74.34%, 64.04% and 68.81%, respectively. Compared to the classic CRF method, our new model improves the P and F1 values by 9.41% and 4.26%, while the R value was basically the same.[Limitations] The accuracy of Chinese word segmentation tools may affect the performance of our model. More research is needed to study the relationship among entities.[Conclusions] The proposed model could be used for resource and environment information analysis based on fine-grained contents.

Key wordsBi-LSTM+CRF      Text-Mining      Ecological Restoration Technology      Named Entity Recognition     
Received: 08 January 2019      Published: 26 April 2020
ZTFLH:  TP391  
Corresponding Authors: Ma Jianxia     E-mail: majx@lzb.ac.cn

Cite this article:

Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 78-88.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0034     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/78

Research Framework
The Structure of Bi-LSTM+CRF
任务数据集 文献数量 实体
数量
时间实体数量 地名实体数量 生态治理技术名称数量
训练集 380 66 223 13 276 29 333 23 614
验证/开发集 127 23 396 4 135 11 263 7 998
测试集 127 19 204 4 353 8 375 6 476
合计 634 108 823 21 764 48 971 38 088
Entity Statistics of Dataset
文本集 来源 用于词向量训练的部分
未标注
文本集合1
634篇文献题录信息 Title + Keyword + Abstract
未标注
文本集合2
634篇文献题录信息 +实体知识库语料 Title + Keyword + Abstract +Time + Place + Tech
Two Sets of Unlabeled Text for Word2Vec Training
触发词类型 举例
左边界词 实施;设置;计划;开展;采用;治理;营造;种植;发展……
右边界词 技术;模式;体系;方案;措施;工程;治理;修复;方法;XX法;组合;结合;改良;利用……
技术类别 植物措施/方法/技术/模式、动物措施、微生物措施、农业措施、工程措施、化学措施、物理措施、管理措施……
触发词 沙障;沙墙;沙沟;栅栏;阻沙、草方格;固沙;防护林;混交;粘合剂;固沙剂;喷播;补播;草膜;补种;封禁;梯田;造林;流沙固定;防沙治沙;水土保持;农林复合经营;节水集水;沙产业;可再生能源利用……
技术名称 草方格沙障;高立式活沙障;飞播治沙技术;铁路防沙治沙技术;划区轮牧;鱼鳞坑状反坡整地;黄土梯田;旱地农田防护林;造林模式;绿洲农林间作;绿洲农田防护林;农田林网化;砂田技术;盐碱地改造;林草复合法;乔灌混交;改变种植作物;草田轮作;机械阻沙:设置挡沙墙、截沙沟、阻沙栅栏、防沙网;沙障固沙:草方格沙障固沙、黏土沙障、砾石沙障、沙袋沙障……
Common Words and Trigger Words of Ecological Governance Technology
技术名称类型 技术名称子类型及表达模式 样例
辅助技术名称识别相关词表 左边界词(LeftWord) 实施、开展、采用…
右边界词(RightWord) 技术、措施、工程…
土壤类型(Agrotype) 沙地、草甸土…
生态系统/生态区(ECO) 黄土高原、高寒草甸…
生态退化类型(EcoDegType) 荒漠化、石漠化……
简单技术名称 触发词(TriggerWords) 沙障、防护林、固沙剂…
技术类别(TCategory) 生物措施、工程措施…
包含其他实体的技术名称 包含地名(Place)和土壤类型(Agrotype)
表达模式:( LeftWord ) + Place + ( Agrotype )+ TriggerWords + ( RightWord )
柴达木沙地杨树深栽造林
包含生态系统类型/生态区(ECO)
表达模式:ECO + TriggerWords + (RightWord)
绿洲农林间作
黄土高原梯田技术
包含生态退化类型
表达模式:(EcoDegType);EcoDegType + (TriggerWords) + RightWord
荒漠化综合治理技术
冻融荒漠化防治
技术名称短语 LeftWord + TriggerWords + ( RightWord )
TriggerWords + RightWord
TCategory + RightWord
设置生物围栏;采用乔灌混交技术
草方格沙障技术;林草复合法
植物/农业/工程措施
Rules for Named Entity Extraction of Ecological Governance Technology
地名实体类型 地名实体子类型 地名实体表达模式 样例
单独出现的地名实体 无通名用字的地名(Place) Place 中国、山东、青海……
地名实体简称(PlaceAbb) PlaceAbb 京、津、冀……
地名实体别名(PlaceAli) PlaceAli 燕京、大都、首都、宝岛、六朝古都……
辅助地名识别相关词表 左边界词(LeftWord) - 位于、流经、抵达……
右边界词(RightWord) - 南部、西岸、境内……
通名用字(GeneralWord) - 省、市、县、地区、高原、平原、流域……
复合地名短语 后缀式地名 Place + GeneralWord
Place + RightWord
西北地区、兰州市南部
组合式地名 n × [ Place + (GeneralWord) || PlaceAbb || PlaceAli] 甘肃省兰州市东岗路、
中国甘肃兰州
并列式地名 n × [ [PlaceAbb || PlaceAli || Place + (GeneralWord) ] +
“-”或“、”或“<”或“>”或“和”或“与”+
[PlaceAbb || PlaceAli || Place + (GeneralWord) ] ]
银川-环县-西安、陕>宁>青>甘>新
地名介词短语 LeftWord + Place + GeneralWord
LeftWord + Place + ( GeneralWord ) + RightWord
LeftWord + [ PlaceAbb || PlaceAli ]
途径兰州市、抵达兰州市境内、位于六朝古都
Rules for Entity Extraction of Place Names
符号 含义 符号 含义
O 非实体 I-Place 地名中间
B-Time 时间开头 B-Tech 技术开头
I-Time 时间中间 I-Tech 技术中间
B-Place 地名开头
Entity Classification and Annotation System
文本中词语 词性 IOB2标注
p O
毛乌素 ns B-Place
沙地 n I-Place
v O
推行 v O
灌草 nw B-Tech
nr I-Tech
相结合 nz I-Tech
p O
…… …… ……
Examples of Entities Annotation
参数 设置
词向量 100维,Word2Vec分布式向量
Bi-LSTM单元数 Num_Units: 256
学习率 Learning_Rate: 0.002
梯度裁剪 Clip: 10
Dropout Dropout_Rate: 0.5;L2_Rate: 0.01(加在全连接层权重上)
句子最大长度 Sequence_Length(Preprocessing.py输出结果中有句子分布)
迭代次数 Nb_Epoch: 200(可提前终止)
dev_size 0.25
Parameters of Bi-LSTM Model Experiment
序号 输入特征 P(%) R(%) F1(%)
Time Place Tech Time Place Tech Time Place Tech
Baseline 62.66 21.69 32.22
61.85 61.88 66.10 30.31 35.40 8.89 40.68 45.04 15.68
w2v-1 70.23 56.19 62.43
79.79 75.12 63.84 66.01 52.18 54.96 72.25 61.58 59.07
w2v-2 70.99 56.41 62.86
79.14 72.05 67.26 62.32 53.19 56.21 69.73 61.20 61.24
w2v-2 +
MoreData
73.10 59.94 65.87
74.16 75.82 65.04 67.06 63.10 53.78 70.43 68.88 58.04
w2v-2 +MoreData+
dictionaries
74.34 64.04 68.81
78.63 82.02 66.02 72.20 65.71 58.75 75.27 72.97 62.17
Experiment Results of Different Combinations of Elements
The Trend of Model Training Loss
语料 实体数量 时间实体数量 地名实体数量 生态治理技术名称数量
训练集 39 739 7 965 17 599 14 175
验证/开发集 14 037 2 481 6 757 4 799
增加的训练语料 35 843 6 965 16 240 12 638
增加实体词典 7 703 547 6 693 463
Entity Number of Different Training Corpus
The Influence of Increasing Training Corpus on P and F1
研究方法 模型 P(%) R(%) F1(%)
传统机器学习 单独CRF方法 64.93 64.17 64.55
深度学习 Bi-LSTM+CRF 74.34 64.04 68.81
The Results of Bi-LSTM+CRF and CRF
[1] 甄霖, 王继军, 姜志德 , 等. 生态技术评价方法及全球生态治理技术研究[J]. 生态学报, 2016,36(22):7152-7157.
[1] ( Zhen Lin, Wang Jijun, Jiang Zhide , et al. The Methodology for Assessing Ecological Restoration Technologies and Evaluation of Global Ecosystem Rehabilitation Technologies[J]. Acta Ecologica Sinica, 2016,36(22):7152-7157.)
[2] 国家发展和改革委员会. 全国主体功能区规划[M]. 北京: 人民出版社, 2015.
[2] (National Development and Reform Commission. Planning of Major Function Regionalization[M]. Beijing: People’s Publishing House, 2015.)
[3] Habibi M, Weber L, Neves M , et al. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition[J]. Bioinformatics, 2017,33(14):37-48.
[4] Wang X, Zhang Y, Ren X , et al. Cross-Type Biomedical Named Entity Recognition with Deep Multi-Task Learning[J]. Bioinformatics, 2018,35(10):1745-1752.
[5] Yoon W, So C H, Lee J , et al. CollaboNet: Collaboration of Deep Neural Networks for Biomedical Named Entity Recognition[J]. BMC Bioinformatics, 2019, 20(10): Article No. 249.
[6] Huang Z, Xu W, Yu K . Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508. 01991.
[7] Strubell E, Verga P, Belanger D , et al. Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[OL]. arXiv Preprint, arXiv:1702.02098.
[8] 徐飞, 叶文豪, 宋英华 . 基于BiLSTM-CRF模型的食品安全事件词性自动标注研究[J]. 情报学报, 2018,37(12):1204-1211.
[8] ( Xu Fei, Ye Wenhao, Song Yinghua . Part-of-Speech Automated Annotation of Food Safety Events Based on BiLSTM-CRF[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(12):1204-1211.)
[9] Bhasuran B, Natarajan J . Automatic Extraction of Gene-disease Associations from Literature Using Joint Ensemble Learning[J]. PLoS One, 2018,13(7):e0200699.
[10] Wiese G, Weissenborn D, Neves M . Neural Domain Adaptation for Biomedical Question Answering[OL]. arXiv Preprint, arXiv:1706.03610.
[11] Le Cun Y, Bengio Y, Hinton G . Deep Learning[J]. Nature, 2015,521(7553):436-444.
[12] Wang X, Zhang Y, Ren X , et al. Cross-Type Biomedical Named Entity Recognition with Deep Multi-Task Learning[J]. Bioinformatics, 2018,35(10):1745-1752.
[13] Hu K, Luo Q, Qi K , et al. Understanding the Topic Evolution of Scientific Literatures like an Evolving City: Using Google Word2Vec Model and Spatial Autocorrelation Analysis[J]. Information Processing & Management, 2019,56(4):1185-1203.
[14] Wang C, Ma X, Chen J , et al. Information Extraction and Knowledge Graph Construction from Geoscience Literature[J]. Computers and Geosciences, 2018,112:112-120.
[15] Peters S E, McClennen M . The Paleobiology Database Application Programming Interface[J]. Paleobiology, 2016,42(1):1-7.
[16] Peters S E, Zhang C, Livny M , et al. A Machine Reading System for Assembling Synthetic Paleontological Databases[J]. PLoS One, 2014,9(12):e113523.
[17] Holden E J, Liu W, Horrocks T , et al. GeoDocA - Fast Analysis of Geological Content in Mineral Exploration Reports: A Text Mining Approach[J].Ore Geology Reviews, 2019, 111(8):Article 102919.
[18] Qiu Q, Xie Z, Wu L , et al. Geoscience Keyphrase Extraction Algorithm Using Enhanced Word Embedding[J]. Expert Systems with Applications, 2019,125(1):157-169.
[19] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,26:3111-3119.
[20] 来斯惟 . 基于神经网络的词和文档语义向量表示方法研究[D]. 北京: 中国科学院自动化研究所, 2016.
[20] ( Lai Siwei . Word and Document Embeddings based on Neural Network Approaches[D]. Beijing: Institute of Automation, Chinese Academy of Sciences, 2016.)
[21] Peters M E, Neumann M, Iyyer M , et al. Deep Contextualized Word Representations[OL]. rXiv Preprint, arXiv: 1802.05365.
[22] Devlin J, Chang M W, Lee K , et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[1] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[2] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[3] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[4] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[5] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[6] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[7] Fan Xinyue,Cui Lei. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[8] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[9] Wang Run,He Lin,Wang Dongbo,Huang Shuiqing,Fan Yuanbiao. Research on Plant Growth and Development Stage Named Entity Recognition for Text Mining[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[10] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
[11] Yu Chuanming, Huang Jianqiu, Guo Fei. Recognizing Named Entity from Free-text Customer Reviews——A Maximum Entropy Model-based Approach[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[12] Sun Zhen Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. 现代图书情报技术, 2010, 26(6): 42-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn