Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (3): 142-154    DOI: 10.11925/infotech.2096-3467.2022.0348
Current Issue | Archive | Adv Search |
Medical Named Entity Recognition with Domain Knowledge
Pei Wei1,2,3,Sun Shuifa1,2,3,Li Xiaolong2,4,Lu Ji2,Yang Liu5,Wu Yirong6()
1Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydropower Engineering, China Three Gorges University, Yichang 443002, China
2Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China
3College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China
4College of Economics & Management,China Three Gorges University,Yichang 443002, China
5Faculty of Psychology,Beijing Normal University, Zhuhai 519087, China
6Institute of Advanced Studies in Humanities and Social Sciences, Beijing Normal University,Zhuhai 519087, China
Download: PDF (1494 KB)   HTML ( 27
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper builds a graph neural network model integrating medical domain knowledge(GraphModel-Dict) to identify named entities from medical texts. [Methods] First, we used the graph neural network structure to integrate domain knowledge, mapping the raw text data and domain dictionaries as nodes of different categories. We also updated the nodes of raw text data with Gated Recurrent Unit (GRU) to obtain their semantic representation with domain knowledge. Then, we used the representation of the text data node as an input to a Bidirectional Long Short-Term Memory network (BiLSTM). We predicted the labels and generated recognition results with a Conditional Random Field (CRF) model. Finally, we evaluated GraphModel-Dict’s performance on two datasets. [Results] We examined the GraphModel-Dict on a manually annotated dataset of 3,100 Chinese ultrasound examination reports on breast cancer. The model’s precision, recall, and F1-score for entity recognition reached 96.91%, 97.52%, and 97.22%, respectively. Furthermore, GraphModel-Dict showed better recognition performance for entity types with fewer sample data or diverse expressions. On the CCKS2020 medical dataset, the F1-value of GraphModel-Dict increased by at least 1.39% compared to the baseline model. [Limitations] More research is needed to examine the effectiveness of the proposed model in other fields. [Conclusions] Integrating domain knowledge can improve the effectiveness of named entity recognition, which benefits medical information mining and clinical research.

Key wordsMedical Named Entity Recognition      Graph Neural Network      Domain Knowledge Dictionary      Breast Cancer Ultrasound Examination Reports     
Received: 15 April 2022      Published: 09 November 2022
ZTFLH:  TP391  
Fund:National Social Science Fund Project(20BTQ066)
Corresponding Authors: Wu Yirong,ORCID:0000-0003-3535-2033,E-mail:yirongwu@hotmail.com。   

Cite this article:

Pei Wei, Sun Shuifa, Li Xiaolong, Lu Ji, Yang Liu, Wu Yirong. Medical Named Entity Recognition with Domain Knowledge. Data Analysis and Knowledge Discovery, 2023, 7(3): 142-154.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0348     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I3/142

Architecture of GraphModel-Dict
Annotation Results
Example of BIO Annotation Format
梯队 实体类型 训练集 测试集 验证集 总计
1 位置(Location) 15 594 1 716 1 576 18 886
回声(Echo) 8 022 871 760 9 653
否定(Negation) 5 229 597 542 6 368
大小(Size) 4 849 520 439 5808
2 血流(Vascularity) 3 598 401 346 4 345
肿块(Masses) 3 516 404 383 4 303
边缘(Margin) 3 346 370 325 4 041
导管变化(DuctChanges) 2 631 287 267 3 185
淋巴结(LymphNode) 1 550 173 119 1 842
形状(Shape) 694 56 77 827
3 结构扭曲
(ArchitecturalDistortion)
35 5 11 51
后部特征(PosteriorFeatures) 39 4 3 46
组织成分(TissueComposition) 22 2 2 26
钙化(Calcifications) 18 3 3 24
皮肤(Skin) 16 1 1 18
总计 49 159 5 410 4 854 59 423
Statistics of BI-RADS Entity Types and Their Annotation Corpus
实体类型 训练集 测试集 验证集 总计
解剖部位 6 951 941 895 8 787
疾病和诊断 3 473 439 420 4 332
药物 1 553 173 201 1 927
实验室检验 1 006 150 131 1 287
影像检查 807 93 102 1 002
手术 741 88 89 918
总计 14 531 1 884 1 838 18 253
Statistics of CCKS2020 Medical Dataset Entity Types and Their Annotation Corpus
数据来源 词汇匹配数量
训练集 测试集 验证集
乳腺癌超声检查报告数据集 90 747 9 961 9 206
CCKS2020医疗数据集 52 629 6 522 6 727
Domain Knowledge Dictionary Vocabulary Matching Statistics
参数
隐藏层维度 300
批处理大小 32
训练轮数 100
丢弃率 0.5
学习率 0.001
char-embedding-dim 50
dictionary-embedding-dim 50
优化器 Adam
Experimental Parameter Setting
模型 精确率/% 召回率/% F1/%
CRF 96.56 97.15 96.86
BiLSTM 95.89 97.23 96.55
BiLSTM-CRF 96.62 97.48 97.05
BERT-BiLSTM-CRF 96.69 97.51 97.10
GraphModel 96.79 97.43 97.11
GraphModel-Dict(本文) 96.91 97.52 97.22
Comparison of Experimental Results
梯队 实体类型 F1/%
CRF BiLSTM BiLSTM-CRF BERT-BiLSTM-CRF GraphModel GraphModel-Dict
1 位置 97.50 97.10 97.85 97.62 97.56 97.65
回声 96.04 95.82 97.27 97.41 96.39 96.62
否定 99.75 99.75 99.62 99.94 99.75 99.75
大小 99.71 99.90 100.00 99.87 100.00 100.00
2 血流 91.77 91.60 90.82 89.67 92.02 92.77
肿块 92.82 92.79 92.06 92.20 92.59 92.82
边缘 98.79 98.25 98.29 98.80 98.79 98.92
导管变化 99.30 98.62 98.94 99.47 99.48 99.13
淋巴结 93.52 93.64 89.51 93.62 93.79 94.35
形状 95.73 95.73 94.51 93.48 96.55 96.55
3 结构扭曲 60.00 33.33 60.00 60.00 54.55 80.00
后部特征 66.67 66.67 80.00 80.00 80.00 80.00
组织成分 0.00 0.00 0.00 0.00 0.00 50.00
钙化 66.67 66.67 33.33 33.33 0.00 66.67
皮肤 0.00 0.00 0.00 0.00 0.00 0.00
Performance of Entity Type in Breast Cancer Ultrasound Dataset
Distribution of the Number of Entities and Representation Styles
模型 精确率/% 召回率/% F1/%
CRF 81.97 78.93 80.42
BERT-BiLSTM-CRF 77.91 79.54 78.72
GraphModel-Dict(本文) 81.26 82.38 81.81
Experimental Results of CCKS2020 Medical Dataset
实体类型 F1/%
CRF BERT-BiLSTM-CRF GraphModel-Dict
解剖部位 79.93 77.27 81.83
疾病和诊断 77.23 75.50 77.06
药物 90.09 85.71 90.59
实验室检验 79.93 77.27 82.12
影像检查 83.80 83.33 87.50
手术 80.90 78.74 76.14
Performance of Entity Type in CCKS2020 Medical Dataset
错误类型 正确标注 错误预测
修饰丢失 Vascularity:异常彩流信号 Vascularity:彩流信号
解剖部位:双侧腋上 解剖部位:双侧腋
标注不一致 Location:左侧乳腺12点钟方向距乳头1.2 cm Location:左侧乳腺12点钟方向
Location:右侧乳腺5点钟方向 Location:右侧乳腺5点钟方向距皮下0.4cm距乳头3.0cm处
样本极少 Skin:皮质增厚 None
Examples of Errors
[1] Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
doi: 10.1109/TKDE.2020.2981314
[2] Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
[3] Zhang X, Xu G W, Sun Y H, et al. Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5558-5570.
[4] Lai T, Ji H, Zhai C X, et al. Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference[OL]. arXiv Preprint, arXiv: 2105.13456.
[5] Zhang Y, Yang J. Chinese NER Using Lattice LSTM[OL]. arXiv Preprint, arXiv:1805.02023.
[6] Li X N, Yan H, Qiu X P, et al. FLAT: Chinese NER Using Flat-Lattice Transformer[OL]. arXiv Preprint, arXiv:2004.11795.
[7] Dang T H, Le H Q, Nguyen T M, et al. D3NER: Biomedical Named Entity Recognition Using CRF-BiLSTM Improved with Fine-Tuned Embeddings of Various Linguistic Information[J]. Bioinformatics, 2018, 34(20): 3539-3546.
doi: 10.1093/bioinformatics/bty356 pmid: 29718118
[8] Ding R X, Xie P J, Zhang X Y, et al. A Neural Multi-digraph Model for Chinese NER with Gazetteers[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1462-1467.
[9] Li X, Wen Q H, Lin H, et al. Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records[J]. Data Intelligence, 2021, 3(3): 376-388.
doi: 10.1162/dint_a_00093
[10] Bozkurt S, Lipson J A, Senol U, et al. Automatic Abstraction of Imaging Observations with Their Characteristics from Mammography Reports[J]. Journal of the American Medical Informatics Association, 2015, 22(e1): e81-e92.
doi: 10.1136/amiajnl-2014-003009
[11] Sippo D A, Warden G I, Andriole K P, et al. Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing[J]. Journal of Digital Imaging, 2013, 26(5): 989-994.
doi: 10.1007/s10278-013-9616-5 pmid: 23868515
[12] Bozkurt S, Gimenez F, Burnside E S, et al. Using Automatically Extracted Information from Mammography Reports for Decision-Support[J]. Journal of Biomedical Informatics, 2016, 62: 224-231.
doi: 10.1016/j.jbi.2016.07.001 pmid: 27388877
[13] Li D C, Kipper-Schuler K, Savova G. Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts[C]// Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. 2008: 94-95.
[14] Settles B. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets[C]// Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. 2004: 104-107.
[15] Xu K, Zhou Z F, Hao T Y, et al. A Bidirectional LSTM and Conditional Random Fields Approach to Medical Named Entity Recognition[C]// Proceedings of International Conference on Advanced Intelligent Systems and Informatics. 2017: 355-365.
[16] 刘婧茹, 宋阳, 贾睿, 等. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[16] ( Liu Jingru, Song Yang, Jia Rui, et al. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. Data Analysis and Knowledge Discovery, 2020, 4(10): 124-133.)
[17] Tang B Z, Wang X L, Yan J, et al. Entity Recognition in Chinese Clinical Text Using Attention-Based CNN-LSTM-CRF[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 3): 74.
doi: 10.1186/s12911-019-0787-y pmid: 30943972
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[19] Jawahar G, Sagot B, Seddah D. What does BERT Learn about the Structure of Language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3651-3657.
[20] Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing[J]. ACM Transactions on Computing for Healthcare, 2022, 3(1): 2.
[21] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[22] He Y, Zhu Z W, Zhang Y, et al. Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 4604-4614.
[23] D’Orsi C, Morris E, Mendelson E. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System[M]. American College of Radiology, 2013.
[24] 中国抗癌协会乳腺癌专业委员会. 中国抗癌协会乳腺癌诊治指南与规范(2021年版)[J]. 中国癌症杂志, 2021, 31
[24] ( 10:954-1040. (Breast Cancer Committee of Chinese Anti-Cancer Association. Chinese Guidelines for Diagnosis and Treatment of Breast Cancer[J]. China Oncology, 2021, 31(10): 954-1040.)
[25] Sung H, Ferlay J, Siegel R L, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2021, 71(3): 209-249.
doi: 10.3322/caac.21660 pmid: 33538338
[26] Gao H Y, Bowles E J A, Carrell D, et al. Using Natural Language Processing to Extract Mammographic Findings[J]. Journal of Biomedical Informatics, 2015, 54: 77-84.
doi: 10.1016/j.jbi.2015.01.010 pmid: 25661260
[27] Castro S M, Tseytlin E, Medvedeva O, et al. Automated Annotation and Classification of BI-RADS Assessment from Radiology Reports[J]. Journal of Biomedical Informatics, 2017, 69: 177-187.
doi: S1532-0464(17)30081-3 pmid: 28428140
[28] Forsyth A W, Barzilay R, Hughes K S, et al. Machine Learning Methods to Extract Documentation of Breast Cancer Symptoms from Electronic Health Records[J]. Journal of Pain and Symptom Management, 2018, 55(6): 1492-1499.
doi: S0885-3924(18)30082-4 pmid: 29496537
[29] Short R G, Bralich J, Bogaty D, et al. Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach[J]. Journal of Digital Imaging, 2019, 32(5): 685-692.
doi: 10.1007/s10278-018-0141-4 pmid: 30338478
[30] Miao S M, Xu T Y, Wu Y H, et al. Extraction of BI-RADS Findings from Breast Ultrasound Reports in Chinese Using Deep Learning Approaches[J]. International Journal of Medical Informatics, 2018, 119: 17-21.
doi: S1386-5056(18)30922-5 pmid: 30342682
[31] Zhou M T, Tang T L, Lu J, et al. Extracting BI-RADS Features from Mammography Reports in Chinese Based on Machine Learning[J]. Journal of Flow Visualization and Image Processing, 2021, 28(2): 55-68.
doi: 10.1615/JFlowVisImageProc.v28.i2
[32] Ling Y, Hasan S A, Farri O, et al. A Domain Knowledge-Enhanced LSTM-CRF Model for Disease Named Entity Recognition[J]. AMIA Joint Summits on Translational Science Proceedings, 2019, 2019: 761-770.
pmid: 31259033
[33] Li Y, Du G D, Xiang Y, et al. Towards Chinese Clinical Named Entity Recognition by Dynamic Embedding Using Domain-specific Knowledge[J]. Journal of Biomedical Informatics, 2020, 106: 103435.
doi: 10.1016/j.jbi.2020.103435
[34] 李纲, 潘荣清, 毛进, 等. 整合BiLSTM-CRF网络和词典资源的中文电子病历实体识别[J]. 现代情报, 2020, 40(4): 3-12, 58.
doi: 10.3969/j.issn.1008-0821.2020.04.001
[34] ( Li Gang, Pan Rongqing, Mao Jin, et al. Entity Recognition of Chinese Electronic Medical Records Based on BiLSTM-CRF Network and Dictionary Resources[J]. Journal of Modern Information, 2020, 40(4): 3-12, 58.)
doi: 10.3969/j.issn.1008-0821.2020.04.001
[35] Zhang Y Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning[J]. Database: The Journal of Biological Databases and Curation, 2016. DOI: 10.1093/database/baw049.
doi: 10.1093/database/baw049
[36] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[37] Chung J, Gülçehre Ç, Cho K, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint, arXiv: 1412.3555.
[38] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 260-270.
[39] Kombrink S, Mikolov T, Karafiát M, et al. Recurrent Neural Network Based Language Modeling in Meeting Recognition[C]// Proceedings of the 12th Annual Conference of the International Speech Communication Association. 2011: 2877-2880.
[40] Hochreiter S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 1998, 6(2): 107-116.
doi: 10.1142/S0218488598000094
[41] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[42] Schuster M, Paliwal K K. Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
doi: 10.1109/78.650093
[43] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[44] Douzas G, Bacao F. Geometric SMOTE a Geometrically Enhanced Drop-in Replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135.
doi: 10.1016/j.ins.2019.06.007
[45] Zhang C, Tan K C, Li H Z, et al. A Cost-Sensitive Deep Belief Network for Imbalanced Classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(1): 109-122.
doi: 10.1109/TNNLS.2018.2832648 pmid: 29993587
[1] Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe. Entity Recognition and Labeling for Medical Literature Based on Neural Network[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[2] Cheng Quan, She Dexin. Drug Recommendation Based on Graph Neural Network with Patient Signs and Medication Data[J]. 数据分析与知识发现, 2022, 6(9): 113-124.
[3] Wang Lu, Le Xiaoqiu. Identifying Topic-Problem Instances Based on Syntactic Dependency Enhancement[J]. 数据分析与知识发现, 2022, 6(12): 13-22.
[4] Wang Jie,Gao Yuan,Zhang Lei,Ma Liwen,Feng Jun. Predicting Short-Term Urban Traffics Based on Causality Analysis Graph[J]. 数据分析与知识发现, 2022, 6(11): 111-125.
[5] Gu Yaowen,Zheng Si,Yang Fengchun,Li Jiao. GNN-MTB: An Anti-Mycobacterium Drug Virtual Screening Model Based on Graph Neural Network[J]. 数据分析与知识发现, 2022, 6(11): 93-102.
[6] Feng Xiaodong, Hui Kangxin. Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[7] Huang Xuejian, Liu Yuyang, Ma Tinghuai. Classification Model for Scholarly Articles Based on Improved Graph Neural Network[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[8] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn