Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (9): 100-112    DOI: 10.11925/infotech.2096-3467.2021.1414
Current Issue | Archive | Adv Search |
Entity Recognition and Labeling for Medical Literature Based on Neural Network
Zhao Ruijie,Tong Xinyu,Liu Xiaohua,Lu Yonghe()
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
Download: PDF (997 KB)   HTML ( 27
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new entity recognition model, aiming to find new knowledge effectively and improve the utilization of medical papers. [Methods] We constructed a pharmaceutical entity recognition model based on Attention-BiLSTM-CRF and examined it on the public datasets of GENIA Term Annotation Task and BioCreative II Gene Mention Tagging. We also used the model to annotate abstracts of biomedical scientific papers. [Results] The F1 values of our model on the two data sets were 81.57% and 84.23%, while the accuracy rates were 92.51% and 97.85%. These results are better than those of the benchmark ones. Moreover, our model has more advantages in processing the extremely unbalanced data. [Limitations] The volume of data and application of entity labeling experiments are relatively homogeneous. [Conclusions] The proposed model improves the effectiveness of entity recognition and mining of new medical knowledge.

Key wordsBiomedical Named Entity Recognition      Entity Annotation      Neural Network      Attention Mechanism     
Received: 15 December 2021      Published: 26 October 2022
ZTFLH:  G350  
Fund:Science and Technology Program of Guangzhou, China(202002020036)
Corresponding Authors: Lu Yonghe,ORCID:0000-0002-7758-9365     E-mail: luyonghe@mail.sysu.edu.cn

Cite this article:

Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network. Data Analysis and Knowledge Discovery, 2022, 6(9): 100-112.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1414     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I9/100

Medical Entity Recognition Model Framework Based on Attention-BiLSTM-CRF
标签名 GENIA BCII-GM
标签数量 标签占比 标签数量 标签占比
训练集 标签B 27 110 9.29% 8 012 2.08%
标签I 20 089 6.89% 6 294 1.63%
标签E 27 110 9.29% 8 012 2.08%
标签S 15 958 5.47% 10 633 2.76%
标签O 201 456 69.06% 352 224 91.45%
验证集 标签B 8 972 9.19% 1 321 2.03%
标签I 6 967 7.14% 1 061 1.63%
标签E 8 972 9.19% 1 321 2.03%
标签S 5 008 5.13% 1 889 2.91%
标签O 67 696 69.35% 59 432 91.40%
测试集 标签B 9 475 9.76% 1 394 2.14%
标签I 6 935 7.14% 1 035 1.59%
标签E 9 475 9.76% 1 394 2.14%
标签S 4 684 4.82% 1 882 2.88%
标签O 66 522 68.52% 59 544 91.26%
Number of Tags in GENIA and BCII-GM
GENIA BCII-GM
句子数 词数 句子数 词数
训练集 11 127 15 294 14 975 36 063
验证集 3 709 18 081 2 500 39 827
测试集 3 710 20 555 2 500 43 304
Number of Sentences and Words in GENIA and BCII-GM
模型 参数值
BiLSTM 隐层神经元数为100
CNN 卷积核高度为2、4、5
卷积核数量均为50
全连接层神经元数为100
BiLSTM-CNN BiLSTM隐层神经元数为100
CNN卷积核高度为4
卷积核数为100
Parameter Setting of Character Feature Extraction
模型 参数值
Attention-BiLSTM-CRF 编码器神经元数为100
解码器神经元数为200
全连接神经元数为100
BiLSTM-CRF 隐层神经元数为100
Deep CNN-CRF 卷积核高度为5
卷积核数为100
网络层数为6
Parameter Setting of Sentence Level Feature Extraction
模型
参数
Attention-BiLSTM-CRF/ BiLSTM-CRF/ Deep CNN-CRF BERT
批次大小 70 10
最大训练次数 80 5
随机失活概率 0.5 0.1
学习率 0.01 0.0001
学习率预热步数 - 500
学习率衰减率 0.9 -
优化算法 Adam Adam
早停法参数 10 -
Paramaters Setting
标签序列 策略
…,B,I,…,I,B(S),… 将[B,I…,I]视为模型标注的实体
…,B,I,…,I,O,… 将[B,I…,I]视为模型标注的实体
…O,B(I\E),O,… 将[B(I\E)]视为模型标注的实体
…,O,I(E),I,…,I,E(I),O,… 将[I(E),I,…,I,E(I)]视为模型标注的实体
…,S(E),I(E),I,…,I,E(I),O,… 将[I(E),I,…,I,E(I)]视为模型标注的实体
Entity Extraction Strategies for Special Tag Sequences
模型

数据集
GENIA BCII-GM
精准率 召回率 F1 准确率 精准率 召回率 F1 准确率
char-BiLSTM+Attention-BiLSTM-CRF 81.73% 81.42% 81.57% 92.51% 84.40% 84.07% 84.23% 97.85%
char-CNN+Attention-BiLSTM-CRF 81.79% 78.18% 79.95% 91.96% 81.22% 77.87% 79.51% 97.15%
char-BiLSTM-CNN+Attention-BiLSTM-CRF 80.69% 82.20% 81.44% 92.43% 83.44% 83.49% 83.46% 97.72%
Test Results of Character Feature Vector Model
模型

数据集
GENIA BCII-GM
F1 准确率 F1 准确率
char-BiLSTM+BiLSTM-CRF 81.55% 92.53% 83.68% 97.81%
char-BiLSTM+Deep CNN-CRF 70.77% 87.85% 59.81% 94.85%
char-BiLSTM+Attention-BiLSTM-CRF 81.57% 92.51% 84.23% 97.85%
BERT-CRF 84.45% 91.99% 81.99% 97.56%
KoBioLM - - 85.10% -
Triaffine+ BioBERT 81.23% - - -
Test Results of This Model and Benchmark Model on GENIA and BCII-GM
模型G 模型B
实体名 频率 实体名 频率
RUNX3
(RUNX3基因)
1 KRAS mutant
(KRAS突变体)
7
DNA methylation
(DNA甲基化)
63 BAP1
(BRCA相关蛋白)
4
human leukocyte antigen
(人类白细胞抗原)
13 Myc
(Myc癌基因组)
4
mtDNA genome
(线粒体基因组)
1 β 2AR
(β肾上腺素能受体)
2
rheumatoid arthritis
(类风湿性关节炎)
16 ABCG2
(ABCG2基因)
1
Candidate Pharmaceutical Entities Extracted by Model G and Model B
[1] 张海楠, 伍大勇, 刘悦, 等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4): 28-35.
[1] ( Zhang Hainan, Wu Dayong, Liu Yue, et al. Chinese Named Entity Recognition Based on Deep Neural Network[J]. Journal of Chinese Information Processing, 2017, 31(4): 28-35.)
[2] 姚霖, 刘轶, 李鑫鑫, 等. 词边界字向量的中文命名实体识别[J]. 智能系统学报, 2016, 11(1): 37-42.
[2] ( Yao Lin, Liu Yi, Li Xinxin, et al. Chinese Named Entity Recognition via Word Boundary Based Character Embedding[J]. CAAI Transactions on Intelligent Systems, 2016, 11(1): 37-42.)
[3] Bengio Y, Schwenk H, Senécal J S, et al. Neural Probabilistic Language Models[A]//Holmes D E, Jain L C. Innovations in Machine Learning[M]. 2006: 137-186.
[4] Luong T, Pham H, Manning C D. Effective Approaches to Attention-Based Neural Machine Translation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1412-1421.
[5] 张帆, 王敏. 基于深度学习的医疗命名实体识别[J]. 计算技术与自动化, 2017, 36(1): 123-127.
[5] ( Zhang Fan, Wang Min. Medical Text Entities Recognition Method Base on Deep Learning[J]. Computing Technology and Automation, 2017, 36(1): 123-127.)
[6] 张聪品, 方滔, 刘昱良. 基于LSTM-CRF命名实体识别技术的研究与应用[J]. 计算机技术与发展, 2019, 29(2): 106-108.
[6] ( Zhang Congpin, Fang Tao, Liu Yuliang. Research and Application of Named Entity Recognition Based on LSTM-CRF[J]. Computer Technology and Development, 2019, 29(2): 106-108.)
[7] 申站. 基于神经网络的中文电子病历命名实体识别[D]. 北京: 北京邮电大学, 2018.
[7] ( Shen Zhan. Named Entity Recognition for Chinese Electronic Record with Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[8] 薛天竹. 面向医疗领域的中文命名实体识别[D]. 哈尔滨: 哈尔滨工业大学, 2017.
[8] ( Xue Tianzhu. Research on Chinese Named Entity Recognition in Medical Field[D]. Harbin: Harbin Institute of Technology, 2017.)
[9] dos Santos C N, Zadrozny B. Learning Character-Level Representations for Part-of-Speech Tagging[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1818-1826.
[10] LeCun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
doi: 10.1109/5.726791
[11] Zhao Z H, Yang Z H, Luo L, et al. Disease Named Entity Recognition from Biomedical Literature Using a Novel Convolutional Neural Network[J]. BMC Medical Genomics, 2017, 10(S5): 73.
doi: 10.1186/s12920-017-0316-8
[12] Elman J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1
[13] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[14] Cho K, van Merrienboer B, Bahdanau D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]// Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. 2014: 103-111.
[15] Huang D G, Jin L K, Song D X, et al. Biomedical Named Entity Recognition Based on Recurrent Neural Networks with Different Extended Methods[J]. International Journal of Data Mining and Bioinformatics, 2016, 16(1): 17.
doi: 10.1504/IJDMB.2016.079799
[16] Liu Z J, Yang M, Wang X L, et al. Entity Recognition from Clinical Texts via Recurrent Neural Network[J]. BMC Medical Informatics and Decision Making, 2017, 17(S2): 67.
doi: 10.1186/s12911-017-0468-7
[17] Sahu S, Anand A. Recurrent Neural Network Models for Disease Name Recognition Using Domain Invariant Features[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2216-2225.
[18] Gridach M. Character-Level Neural Network for Biomedical Named Entity Recognition[J]. Journal of Biomedical Informatics, 2017, 70: 85-91.
doi: S1532-0464(17)30097-7 pmid: 28502909
[19] Zeng D H, Sun C J, Lin L, et al. LSTM-CRF for Drug-Named Entity Recognition[J]. Entropy, 2017, 19(6): 283.
doi: 10.3390/e19060283
[20] Habibi M, Weber L, Neves M, et al. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition[J]. Bioinformatics, 2017, 33(14): i37-i48.
doi: 10.1093/bioinformatics/btx228
[21] Jauregi Unanue I, Zare Borzeshi E, Piccardi M. Recurrent Neural Networks with Specialized Word Embeddings for Health-Domain Named-Entity Recognition[J]. Journal of Biomedical Informatics, 2017, 76: 102-109.
doi: S1532-0464(17)30244-7 pmid: 29146561
[22] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[23] Souza F, Nogueira R, Lotufo R. Portuguese Named Entity Recognition Using BERT-CRF[OL]. arXiv Preprint, arXiv: 1909.10649.
[24] Alsentzer E, Murphy J, Boag W, et al. Publicly Available Clinical[C]// Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019: 72-78.
[25] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[26] Lyu C, Chen B, Ren Y F, et al. Long Short-Term Memory RNN for Biomedical Named Entity Recognition[J]. BMC Bioinformatics, 2017, 18(1): 462.
doi: 10.1186/s12859-017-1868-5 pmid: 29084508
[27] Rei M, Crichton G K O, Pyysalo S. Attending to Characters in Neural Sequence Labeling Models[C]// Proceedings of the 26th International Conference on Computational Linguistics. 2016: 309-318.
[28] Luo L, Yang Z H, Yang P, et al. An Attention-Based BiLSTM-CRF Approach to Document-Level Chemical Named Entity Recognition[J]. Bioinformatics, 2018, 34(8): 1381-1388.
doi: 10.1093/bioinformatics/btx761 pmid: 29186323
[29] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM Networks[C]// Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. 2005: 2047-2052.
[30] Milolov T, Corrado G, Chen K, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[31] Yuan Z, Liu Y J, Tan C Q, et al. Improving Biomedical Pretrained Language Models with Knowledge[C]// Proceedings of the 20th Workshop on Biomedical Language Processing. 2021.
[32] Yuan Z, Tan C Q, Huang S F, et al. Fusing Heterogeneous Factors with Triaffine Mechanism for Nested Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2110.07480.
[1] Cheng Quan, She Dexin. Drug Recommendation Based on Graph Neural Network with Patient Signs and Medication Data[J]. 数据分析与知识发现, 2022, 6(9): 113-124.
[2] Chen Yuanyuan, Ma Jing. Detecting Multimodal Sarcasm Based on SC-Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(9): 40-51.
[3] Tang Jiao, Zhang Lisheng, Sang Chunyan. News Recommendation with Latent Topic Distribution and Long and Short-Term User Representations[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[4] Yang Meifang, Yang Bo. Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model[J]. 数据分析与知识发现, 2022, 6(9): 86-99.
[5] Zhao Pengwu, Li Zhiyi, Lin Xiaoqi. Identifying Relationship of Chinese Characters with Attention Mechanism and Convolutional Neural Network[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[6] Zhou Ning, Jin Gaoya, Shi Wenqian. Algorithm for Entity Coreference Resolution with Neural Network and Global Reasoning[J]. 数据分析与知识发现, 2022, 6(8): 75-83.
[7] Yang Wenli, Li Nana. A Text-Aligned Cross-Language Sentiment Classification Method Based on Adversarial Networks[J]. 数据分析与知识发现, 2022, 6(7): 141-151.
[8] Zhang Ruoqi, Shen Jianfang, Chen Pinghua. Session Sequence Recommendation with GNN, Bi-GRU and Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[9] Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[10] Guo Fanrong, Huang Xiaoxi, Wang Rongbo, Chen Zhiqun, Hu Chuang, Xie Yimin, Si Boyu. Identifying Metaphor with Transformer and Graph Convolutional Network[J]. 数据分析与知识发现, 2022, 6(4): 120-129.
[11] Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng. Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[12] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[13] Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao. Extracting Chinese Patent Keywords with LSTM and Logistic Regression[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[14] Wang Nan, Li Hairong, Tan Shuru. Predicting Public Opinion Reversal Based on Evolution Analysis of Events and Improved KE-SMOTE Algorithm[J]. 数据分析与知识发现, 2022, 6(2/3): 396-408.
[15] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn