Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (9): 100-112     https://doi.org/10.11925/infotech.2096-3467.2021.1414
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于神经网络的医药科技论文实体识别与标注研究*
赵蕊洁,佟昕瑀,刘小桦,路永和()
中山大学信息管理学院 广州 510006
Entity Recognition and Labeling for Medical Literature Based on Neural Network
Zhao Ruijie,Tong Xinyu,Liu Xiaohua,Lu Yonghe()
School of Information Management, Sun Yat-Sen University, Guangzhou 510006, China
全文: PDF (997 KB)   HTML ( 27
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 为提高医药实体识别的效果、实现医药新知识的挖掘和提高医药科技论文的利用率,提出一种新的实体识别模型。【方法】 构建基于Attention-BiLSTM-CRF的医药实体识别模型,在公开数据集GENIA Term Annotation Task和BioCreative II Gene Mention Tagging上分别对模型进行测试,进而使用该模型对生物医药论文的摘要进行实体标注。【结果】 本文提出的模型优于其他基准模型,在两个数据集上的F1值分别为81.57%和84.23%、准确率分别为92.51%和97.85%,并且在数据不平衡的情况下更有优势。【局限】 实体标注实验数据量和应用范围较为单一。【结论】 基于Attention-BiLSTM-CRF的医药实体识别模型可以提高实体识别效果并实现医药新知识的挖掘。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵蕊洁
佟昕瑀
刘小桦
路永和
关键词 生物医药实体识别实体标注神经网络注意力机制    
Abstract

[Objective] This paper proposes a new entity recognition model, aiming to find new knowledge effectively and improve the utilization of medical papers. [Methods] We constructed a pharmaceutical entity recognition model based on Attention-BiLSTM-CRF and examined it on the public datasets of GENIA Term Annotation Task and BioCreative II Gene Mention Tagging. We also used the model to annotate abstracts of biomedical scientific papers. [Results] The F1 values of our model on the two data sets were 81.57% and 84.23%, while the accuracy rates were 92.51% and 97.85%. These results are better than those of the benchmark ones. Moreover, our model has more advantages in processing the extremely unbalanced data. [Limitations] The volume of data and application of entity labeling experiments are relatively homogeneous. [Conclusions] The proposed model improves the effectiveness of entity recognition and mining of new medical knowledge.

Key wordsBiomedical Named Entity Recognition    Entity Annotation    Neural Network    Attention Mechanism
收稿日期: 2021-12-15      出版日期: 2022-10-26
ZTFLH:  G350  
基金资助:*广州市科技计划基金项目(202002020036)
通讯作者: 路永和,ORCID:0000-0002-7758-9365     E-mail: luyonghe@mail.sysu.edu.cn
引用本文:   
赵蕊洁, 佟昕瑀, 刘小桦, 路永和. 基于神经网络的医药科技论文实体识别与标注研究*[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network. Data Analysis and Knowledge Discovery, 2022, 6(9): 100-112.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1414      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I9/100
Fig.1  基于Attention-BiLSTM-CRF的医药实体识别模型框架
标签名 GENIA BCII-GM
标签数量 标签占比 标签数量 标签占比
训练集 标签B 27 110 9.29% 8 012 2.08%
标签I 20 089 6.89% 6 294 1.63%
标签E 27 110 9.29% 8 012 2.08%
标签S 15 958 5.47% 10 633 2.76%
标签O 201 456 69.06% 352 224 91.45%
验证集 标签B 8 972 9.19% 1 321 2.03%
标签I 6 967 7.14% 1 061 1.63%
标签E 8 972 9.19% 1 321 2.03%
标签S 5 008 5.13% 1 889 2.91%
标签O 67 696 69.35% 59 432 91.40%
测试集 标签B 9 475 9.76% 1 394 2.14%
标签I 6 935 7.14% 1 035 1.59%
标签E 9 475 9.76% 1 394 2.14%
标签S 4 684 4.82% 1 882 2.88%
标签O 66 522 68.52% 59 544 91.26%
Table 1  GENIA数据集和BCII-GM数据集的标签数
GENIA BCII-GM
句子数 词数 句子数 词数
训练集 11 127 15 294 14 975 36 063
验证集 3 709 18 081 2 500 39 827
测试集 3 710 20 555 2 500 43 304
Table 2  GENIA数据集和BCII-GM数据集的句子数与词数
模型 参数值
BiLSTM 隐层神经元数为100
CNN 卷积核高度为2、4、5
卷积核数量均为50
全连接层神经元数为100
BiLSTM-CNN BiLSTM隐层神经元数为100
CNN卷积核高度为4
卷积核数为100
Table 3  字符特征提取实验参数设置
模型 参数值
Attention-BiLSTM-CRF 编码器神经元数为100
解码器神经元数为200
全连接神经元数为100
BiLSTM-CRF 隐层神经元数为100
Deep CNN-CRF 卷积核高度为5
卷积核数为100
网络层数为6
Table 4  句子层级特征提取实验参数设置
模型
参数
Attention-BiLSTM-CRF/ BiLSTM-CRF/ Deep CNN-CRF BERT
批次大小 70 10
最大训练次数 80 5
随机失活概率 0.5 0.1
学习率 0.01 0.0001
学习率预热步数 - 500
学习率衰减率 0.9 -
优化算法 Adam Adam
早停法参数 10 -
Table 5  实验模型参数设置
标签序列 策略
…,B,I,…,I,B(S),… 将[B,I…,I]视为模型标注的实体
…,B,I,…,I,O,… 将[B,I…,I]视为模型标注的实体
…O,B(I\E),O,… 将[B(I\E)]视为模型标注的实体
…,O,I(E),I,…,I,E(I),O,… 将[I(E),I,…,I,E(I)]视为模型标注的实体
…,S(E),I(E),I,…,I,E(I),O,… 将[I(E),I,…,I,E(I)]视为模型标注的实体
Table 6  特殊标签序列的部分实体抽取策略
模型

数据集
GENIA BCII-GM
精准率 召回率 F1 准确率 精准率 召回率 F1 准确率
char-BiLSTM+Attention-BiLSTM-CRF 81.73% 81.42% 81.57% 92.51% 84.40% 84.07% 84.23% 97.85%
char-CNN+Attention-BiLSTM-CRF 81.79% 78.18% 79.95% 91.96% 81.22% 77.87% 79.51% 97.15%
char-BiLSTM-CNN+Attention-BiLSTM-CRF 80.69% 82.20% 81.44% 92.43% 83.44% 83.49% 83.46% 97.72%
Table7  字符特征向量模型在数据集上的测试结果
模型

数据集
GENIA BCII-GM
F1 准确率 F1 准确率
char-BiLSTM+BiLSTM-CRF 81.55% 92.53% 83.68% 97.81%
char-BiLSTM+Deep CNN-CRF 70.77% 87.85% 59.81% 94.85%
char-BiLSTM+Attention-BiLSTM-CRF 81.57% 92.51% 84.23% 97.85%
BERT-CRF 84.45% 91.99% 81.99% 97.56%
KoBioLM - - 85.10% -
Triaffine+ BioBERT 81.23% - - -
Table 8  本模型与基准模型在GENIA和BCII-GM数据集上的测试结果
模型G 模型B
实体名 频率 实体名 频率
RUNX3
(RUNX3基因)
1 KRAS mutant
(KRAS突变体)
7
DNA methylation
(DNA甲基化)
63 BAP1
(BRCA相关蛋白)
4
human leukocyte antigen
(人类白细胞抗原)
13 Myc
(Myc癌基因组)
4
mtDNA genome
(线粒体基因组)
1 β 2AR
(β肾上腺素能受体)
2
rheumatoid arthritis
(类风湿性关节炎)
16 ABCG2
(ABCG2基因)
1
Table 9  模型G和模型B抽取的部分候选医药实体
[1] 张海楠, 伍大勇, 刘悦, 等. 基于深度神经网络的中文命名实体识别[J]. 中文信息学报, 2017, 31(4): 28-35.
[1] ( Zhang Hainan, Wu Dayong, Liu Yue, et al. Chinese Named Entity Recognition Based on Deep Neural Network[J]. Journal of Chinese Information Processing, 2017, 31(4): 28-35.)
[2] 姚霖, 刘轶, 李鑫鑫, 等. 词边界字向量的中文命名实体识别[J]. 智能系统学报, 2016, 11(1): 37-42.
[2] ( Yao Lin, Liu Yi, Li Xinxin, et al. Chinese Named Entity Recognition via Word Boundary Based Character Embedding[J]. CAAI Transactions on Intelligent Systems, 2016, 11(1): 37-42.)
[3] Bengio Y, Schwenk H, Senécal J S, et al. Neural Probabilistic Language Models[A]//Holmes D E, Jain L C. Innovations in Machine Learning[M]. 2006: 137-186.
[4] Luong T, Pham H, Manning C D. Effective Approaches to Attention-Based Neural Machine Translation[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1412-1421.
[5] 张帆, 王敏. 基于深度学习的医疗命名实体识别[J]. 计算技术与自动化, 2017, 36(1): 123-127.
[5] ( Zhang Fan, Wang Min. Medical Text Entities Recognition Method Base on Deep Learning[J]. Computing Technology and Automation, 2017, 36(1): 123-127.)
[6] 张聪品, 方滔, 刘昱良. 基于LSTM-CRF命名实体识别技术的研究与应用[J]. 计算机技术与发展, 2019, 29(2): 106-108.
[6] ( Zhang Congpin, Fang Tao, Liu Yuliang. Research and Application of Named Entity Recognition Based on LSTM-CRF[J]. Computer Technology and Development, 2019, 29(2): 106-108.)
[7] 申站. 基于神经网络的中文电子病历命名实体识别[D]. 北京: 北京邮电大学, 2018.
[7] ( Shen Zhan. Named Entity Recognition for Chinese Electronic Record with Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[8] 薛天竹. 面向医疗领域的中文命名实体识别[D]. 哈尔滨: 哈尔滨工业大学, 2017.
[8] ( Xue Tianzhu. Research on Chinese Named Entity Recognition in Medical Field[D]. Harbin: Harbin Institute of Technology, 2017.)
[9] dos Santos C N, Zadrozny B. Learning Character-Level Representations for Part-of-Speech Tagging[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1818-1826.
[10] LeCun Y, Bottou L, Bengio Y, et al. Gradient-Based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
doi: 10.1109/5.726791
[11] Zhao Z H, Yang Z H, Luo L, et al. Disease Named Entity Recognition from Biomedical Literature Using a Novel Convolutional Neural Network[J]. BMC Medical Genomics, 2017, 10(S5): 73.
doi: 10.1186/s12920-017-0316-8
[12] Elman J L. Finding Structure in Time[J]. Cognitive Science, 1990, 14(2): 179-211.
doi: 10.1207/s15516709cog1402_1
[13] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[14] Cho K, van Merrienboer B, Bahdanau D, et al. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches[C]// Proceedings of the 8th Workshop on Syntax, Semantics and Structure in Statistical Translation. 2014: 103-111.
[15] Huang D G, Jin L K, Song D X, et al. Biomedical Named Entity Recognition Based on Recurrent Neural Networks with Different Extended Methods[J]. International Journal of Data Mining and Bioinformatics, 2016, 16(1): 17.
doi: 10.1504/IJDMB.2016.079799
[16] Liu Z J, Yang M, Wang X L, et al. Entity Recognition from Clinical Texts via Recurrent Neural Network[J]. BMC Medical Informatics and Decision Making, 2017, 17(S2): 67.
doi: 10.1186/s12911-017-0468-7
[17] Sahu S, Anand A. Recurrent Neural Network Models for Disease Name Recognition Using Domain Invariant Features[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 2216-2225.
[18] Gridach M. Character-Level Neural Network for Biomedical Named Entity Recognition[J]. Journal of Biomedical Informatics, 2017, 70: 85-91.
doi: S1532-0464(17)30097-7 pmid: 28502909
[19] Zeng D H, Sun C J, Lin L, et al. LSTM-CRF for Drug-Named Entity Recognition[J]. Entropy, 2017, 19(6): 283.
doi: 10.3390/e19060283
[20] Habibi M, Weber L, Neves M, et al. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition[J]. Bioinformatics, 2017, 33(14): i37-i48.
doi: 10.1093/bioinformatics/btx228
[21] Jauregi Unanue I, Zare Borzeshi E, Piccardi M. Recurrent Neural Networks with Specialized Word Embeddings for Health-Domain Named-Entity Recognition[J]. Journal of Biomedical Informatics, 2017, 76: 102-109.
doi: S1532-0464(17)30244-7 pmid: 29146561
[22] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[23] Souza F, Nogueira R, Lotufo R. Portuguese Named Entity Recognition Using BERT-CRF[OL]. arXiv Preprint, arXiv: 1909.10649.
[24] Alsentzer E, Murphy J, Boag W, et al. Publicly Available Clinical[C]// Proceedings of the 2nd Clinical Natural Language Processing Workshop. 2019: 72-78.
[25] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[26] Lyu C, Chen B, Ren Y F, et al. Long Short-Term Memory RNN for Biomedical Named Entity Recognition[J]. BMC Bioinformatics, 2017, 18(1): 462.
doi: 10.1186/s12859-017-1868-5 pmid: 29084508
[27] Rei M, Crichton G K O, Pyysalo S. Attending to Characters in Neural Sequence Labeling Models[C]// Proceedings of the 26th International Conference on Computational Linguistics. 2016: 309-318.
[28] Luo L, Yang Z H, Yang P, et al. An Attention-Based BiLSTM-CRF Approach to Document-Level Chemical Named Entity Recognition[J]. Bioinformatics, 2018, 34(8): 1381-1388.
doi: 10.1093/bioinformatics/btx761 pmid: 29186323
[29] Graves A, Schmidhuber J. Framewise Phoneme Classification with Bidirectional LSTM Networks[C]// Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. 2005: 2047-2052.
[30] Milolov T, Corrado G, Chen K, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[31] Yuan Z, Liu Y J, Tan C Q, et al. Improving Biomedical Pretrained Language Models with Knowledge[C]// Proceedings of the 20th Workshop on Biomedical Language Processing. 2021.
[32] Yuan Z, Tan C Q, Huang S F, et al. Fusing Heterogeneous Factors with Triaffine Mechanism for Nested Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2110.07480.
[1] 成全, 佘德昕. 融合患者体征与用药数据的图神经网络药物推荐方法研究*[J]. 数据分析与知识发现, 2022, 6(9): 113-124.
[2] 唐娇, 张力生, 桑春艳. 基于潜在主题分布和长、短期用户表示的新闻推荐模型*[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[3] 游新冬, 袁梦龙, 张乐, 吕学强. CNN-SM:基于义原与多特征融合的消费品领域缺陷词识别模型*[J]. 数据分析与知识发现, 2022, 6(9): 77-85.
[4] 杨美芳, 杨波. 基于笔画ELMo嵌入IDCNN-CRF模型的企业风险领域实体抽取研究*[J]. 数据分析与知识发现, 2022, 6(9): 86-99.
[5] 赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[6] 周宁, 靳高雅, 石雯茜. 融合神经网络与全局推理的实体共指消解算法*[J]. 数据分析与知识发现, 2022, 6(8): 75-83.
[7] 杨文丽, 李娜娜. 基于对抗网络的文本对齐跨语言情感分类方法*[J]. 数据分析与知识发现, 2022, 6(7): 141-151.
[8] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[9] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[10] 郭樊容, 黄孝喜, 王荣波, 谌志群, 胡创, 谢一敏, 司博宇. 基于Transformer和图卷积神经网络的隐喻识别*[J]. 数据分析与知识发现, 2022, 6(4): 120-129.
[11] 郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
[12] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[13] 韦婷婷, 江涛, 郑舒玲, 张建桃. 融合LSTM与逻辑回归的中文专利关键词抽取*[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[14] 商容轩, 张斌, 米加宁. 基于BRNN的政务APP评论端到端方面级情感分析方法*[J]. 数据分析与知识发现, 2022, 6(2/3): 364-375.
[15] 王楠, 李海荣, 谭舒孺. 基于舆情事件演化分析及改进KE-SMOTE算法的舆情反转预测研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 396-408.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn