Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (3): 142-154     https://doi.org/10.11925/infotech.2096-3467.2022.0348
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合领域知识的医学命名实体识别研究*
裴伟1,2,3,孙水发1,2,3,李小龙2,4,鲁际2,杨柳5,吴义熔6()
1湖北省水电工程智能视觉监测重点实验室 宜昌 443002
2智慧医疗宜昌市重点实验室 宜昌 443002
3三峡大学计算机与信息学院 宜昌 443002
4三峡大学经济与管理学院 宜昌 443002
5北京师范大学心理学部 珠海 519087
6北京师范大学人文和社会科学高等研究院 珠海 519087
Medical Named Entity Recognition with Domain Knowledge
Pei Wei1,2,3,Sun Shuifa1,2,3,Li Xiaolong2,4,Lu Ji2,Yang Liu5,Wu Yirong6()
1Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydropower Engineering, China Three Gorges University, Yichang 443002, China
2Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China
3College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China
4College of Economics & Management,China Three Gorges University,Yichang 443002, China
5Faculty of Psychology,Beijing Normal University, Zhuhai 519087, China
6Institute of Advanced Studies in Humanities and Social Sciences, Beijing Normal University,Zhuhai 519087, China
全文: PDF (1494 KB)   HTML ( 27
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 构建融合医学领域知识的图神经网络结构模型GraphModel-Dict,针对医学文本进行命名实体识别研究。【方法】 首先,采用图结构方式对领域知识进行融合,将原始文本数据与领域词典作为不同类别的节点进行构图,利用门控循环单元进行节点更新,以得到结合领域知识的原始文本数据节点语义表示;其次,将文本数据节点的最终表示作为双向长短期记忆网络的输入;然后,通过条件随机场预测标签并输出识别序列;最后,使用两个数据集评估模型的性能。【结果】 在人工标注的3 100份中文乳腺癌超声检查报告数据集上,GraphModel-Dict模型的实体识别的精确率、召回率和F1值达到96.91%、97.52%以及97.22%。另外,在对每类实体的识别效果评估中,针对提取样本数据稀少或表达形式多样化的实体类型,GraphModel-Dict模型表现出更优的识别性能。在CCKS2020医疗数据集上进行性能评估实验,与基线模型相比,GraphModel-Dict模型的F1值至少提高了1.39%。【局限】 GraphModel-Dict模型的实验只在医疗数据集上展开,在其他领域的有效性需进一步研究。【结论】 领域知识的有效使用能够提高其在命名实体识别中的作用,为促进医学信息挖掘和临床研究提供了潜力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
裴伟
孙水发
李小龙
鲁际
杨柳
吴义熔
关键词 医学命名实体识别图神经网络领域知识词典乳腺癌超声检查报告    
Abstract

[Objective] This paper builds a graph neural network model integrating medical domain knowledge(GraphModel-Dict) to identify named entities from medical texts. [Methods] First, we used the graph neural network structure to integrate domain knowledge, mapping the raw text data and domain dictionaries as nodes of different categories. We also updated the nodes of raw text data with Gated Recurrent Unit (GRU) to obtain their semantic representation with domain knowledge. Then, we used the representation of the text data node as an input to a Bidirectional Long Short-Term Memory network (BiLSTM). We predicted the labels and generated recognition results with a Conditional Random Field (CRF) model. Finally, we evaluated GraphModel-Dict’s performance on two datasets. [Results] We examined the GraphModel-Dict on a manually annotated dataset of 3,100 Chinese ultrasound examination reports on breast cancer. The model’s precision, recall, and F1-score for entity recognition reached 96.91%, 97.52%, and 97.22%, respectively. Furthermore, GraphModel-Dict showed better recognition performance for entity types with fewer sample data or diverse expressions. On the CCKS2020 medical dataset, the F1-value of GraphModel-Dict increased by at least 1.39% compared to the baseline model. [Limitations] More research is needed to examine the effectiveness of the proposed model in other fields. [Conclusions] Integrating domain knowledge can improve the effectiveness of named entity recognition, which benefits medical information mining and clinical research.

Key wordsMedical Named Entity Recognition    Graph Neural Network    Domain Knowledge Dictionary    Breast Cancer Ultrasound Examination Reports
收稿日期: 2022-04-15      出版日期: 2022-11-09
ZTFLH:  TP391  
基金资助:国家社会科学基金项目(20BTQ066)
通讯作者: 吴义熔,ORCID:0000-0003-3535-2033,E-mail:yirongwu@hotmail.com。   
引用本文:   
裴伟, 孙水发, 李小龙, 鲁际, 杨柳, 吴义熔. 融合领域知识的医学命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
Pei Wei, Sun Shuifa, Li Xiaolong, Lu Ji, Yang Liu, Wu Yirong. Medical Named Entity Recognition with Domain Knowledge. Data Analysis and Knowledge Discovery, 2023, 7(3): 142-154.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0348      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I3/142
Fig.1  GraphModel-Dict模型整体架构
Fig.2  标注结果
Fig.3  BIO标注格式示例
梯队 实体类型 训练集 测试集 验证集 总计
1 位置(Location) 15 594 1 716 1 576 18 886
回声(Echo) 8 022 871 760 9 653
否定(Negation) 5 229 597 542 6 368
大小(Size) 4 849 520 439 5808
2 血流(Vascularity) 3 598 401 346 4 345
肿块(Masses) 3 516 404 383 4 303
边缘(Margin) 3 346 370 325 4 041
导管变化(DuctChanges) 2 631 287 267 3 185
淋巴结(LymphNode) 1 550 173 119 1 842
形状(Shape) 694 56 77 827
3 结构扭曲
(ArchitecturalDistortion)
35 5 11 51
后部特征(PosteriorFeatures) 39 4 3 46
组织成分(TissueComposition) 22 2 2 26
钙化(Calcifications) 18 3 3 24
皮肤(Skin) 16 1 1 18
总计 49 159 5 410 4 854 59 423
Table 1  BI-RADS实体类型及其标注语料统计
实体类型 训练集 测试集 验证集 总计
解剖部位 6 951 941 895 8 787
疾病和诊断 3 473 439 420 4 332
药物 1 553 173 201 1 927
实验室检验 1 006 150 131 1 287
影像检查 807 93 102 1 002
手术 741 88 89 918
总计 14 531 1 884 1 838 18 253
Table 2  CCKS2020医疗数据集实体类型及其标注语料统计
数据来源 词汇匹配数量
训练集 测试集 验证集
乳腺癌超声检查报告数据集 90 747 9 961 9 206
CCKS2020医疗数据集 52 629 6 522 6 727
Table 3  领域知识词典词汇匹配数量统计
参数
隐藏层维度 300
批处理大小 32
训练轮数 100
丢弃率 0.5
学习率 0.001
char-embedding-dim 50
dictionary-embedding-dim 50
优化器 Adam
Table 4  实验参数设置
模型 精确率/% 召回率/% F1/%
CRF 96.56 97.15 96.86
BiLSTM 95.89 97.23 96.55
BiLSTM-CRF 96.62 97.48 97.05
BERT-BiLSTM-CRF 96.69 97.51 97.10
GraphModel 96.79 97.43 97.11
GraphModel-Dict(本文) 96.91 97.52 97.22
Table 5  实验结果对比
梯队 实体类型 F1/%
CRF BiLSTM BiLSTM-CRF BERT-BiLSTM-CRF GraphModel GraphModel-Dict
1 位置 97.50 97.10 97.85 97.62 97.56 97.65
回声 96.04 95.82 97.27 97.41 96.39 96.62
否定 99.75 99.75 99.62 99.94 99.75 99.75
大小 99.71 99.90 100.00 99.87 100.00 100.00
2 血流 91.77 91.60 90.82 89.67 92.02 92.77
肿块 92.82 92.79 92.06 92.20 92.59 92.82
边缘 98.79 98.25 98.29 98.80 98.79 98.92
导管变化 99.30 98.62 98.94 99.47 99.48 99.13
淋巴结 93.52 93.64 89.51 93.62 93.79 94.35
形状 95.73 95.73 94.51 93.48 96.55 96.55
3 结构扭曲 60.00 33.33 60.00 60.00 54.55 80.00
后部特征 66.67 66.67 80.00 80.00 80.00 80.00
组织成分 0.00 0.00 0.00 0.00 0.00 50.00
钙化 66.67 66.67 33.33 33.33 0.00 66.67
皮肤 0.00 0.00 0.00 0.00 0.00 0.00
Table 6  乳腺癌超声数据集各实体类型的性能对比
Fig.4  实体数量及表述样式数量分布
模型 精确率/% 召回率/% F1/%
CRF 81.97 78.93 80.42
BERT-BiLSTM-CRF 77.91 79.54 78.72
GraphModel-Dict(本文) 81.26 82.38 81.81
Table 7  CCKS2020医疗数据集实验结果
实体类型 F1/%
CRF BERT-BiLSTM-CRF GraphModel-Dict
解剖部位 79.93 77.27 81.83
疾病和诊断 77.23 75.50 77.06
药物 90.09 85.71 90.59
实验室检验 79.93 77.27 82.12
影像检查 83.80 83.33 87.50
手术 80.90 78.74 76.14
Table 8  CCKS2020医疗数据集各实体类型的性能对比
错误类型 正确标注 错误预测
修饰丢失 Vascularity:异常彩流信号 Vascularity:彩流信号
解剖部位:双侧腋上 解剖部位:双侧腋
标注不一致 Location:左侧乳腺12点钟方向距乳头1.2 cm Location:左侧乳腺12点钟方向
Location:右侧乳腺5点钟方向 Location:右侧乳腺5点钟方向距皮下0.4cm距乳头3.0cm处
样本极少 Skin:皮质增厚 None
Table 9  错误示例
[1] Li J, Sun A X, Han J L, et al. A Survey on Deep Learning for Named Entity Recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
doi: 10.1109/TKDE.2020.2981314
[2] Yan Y M, Li R M, Wang S R, et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5065-5075.
[3] Zhang X, Xu G W, Sun Y H, et al. Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 5558-5570.
[4] Lai T, Ji H, Zhai C X, et al. Joint Biomedical Entity and Relation Extraction with Knowledge-Enhanced Collective Inference[OL]. arXiv Preprint, arXiv: 2105.13456.
[5] Zhang Y, Yang J. Chinese NER Using Lattice LSTM[OL]. arXiv Preprint, arXiv:1805.02023.
[6] Li X N, Yan H, Qiu X P, et al. FLAT: Chinese NER Using Flat-Lattice Transformer[OL]. arXiv Preprint, arXiv:2004.11795.
[7] Dang T H, Le H Q, Nguyen T M, et al. D3NER: Biomedical Named Entity Recognition Using CRF-BiLSTM Improved with Fine-Tuned Embeddings of Various Linguistic Information[J]. Bioinformatics, 2018, 34(20): 3539-3546.
doi: 10.1093/bioinformatics/bty356 pmid: 29718118
[8] Ding R X, Xie P J, Zhang X Y, et al. A Neural Multi-digraph Model for Chinese NER with Gazetteers[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1462-1467.
[9] Li X, Wen Q H, Lin H, et al. Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records[J]. Data Intelligence, 2021, 3(3): 376-388.
doi: 10.1162/dint_a_00093
[10] Bozkurt S, Lipson J A, Senol U, et al. Automatic Abstraction of Imaging Observations with Their Characteristics from Mammography Reports[J]. Journal of the American Medical Informatics Association, 2015, 22(e1): e81-e92.
doi: 10.1136/amiajnl-2014-003009
[11] Sippo D A, Warden G I, Andriole K P, et al. Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing[J]. Journal of Digital Imaging, 2013, 26(5): 989-994.
doi: 10.1007/s10278-013-9616-5 pmid: 23868515
[12] Bozkurt S, Gimenez F, Burnside E S, et al. Using Automatically Extracted Information from Mammography Reports for Decision-Support[J]. Journal of Biomedical Informatics, 2016, 62: 224-231.
doi: 10.1016/j.jbi.2016.07.001 pmid: 27388877
[13] Li D C, Kipper-Schuler K, Savova G. Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts[C]// Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. 2008: 94-95.
[14] Settles B. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets[C]// Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. 2004: 104-107.
[15] Xu K, Zhou Z F, Hao T Y, et al. A Bidirectional LSTM and Conditional Random Fields Approach to Medical Named Entity Recognition[C]// Proceedings of International Conference on Advanced Intelligent Systems and Informatics. 2017: 355-365.
[16] 刘婧茹, 宋阳, 贾睿, 等. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[16] ( Liu Jingru, Song Yang, Jia Rui, et al. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. Data Analysis and Knowledge Discovery, 2020, 4(10): 124-133.)
[17] Tang B Z, Wang X L, Yan J, et al. Entity Recognition in Chinese Clinical Text Using Attention-Based CNN-LSTM-CRF[J]. BMC Medical Informatics and Decision Making, 2019, 19(Suppl 3): 74.
doi: 10.1186/s12911-019-0787-y pmid: 30943972
[18] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[19] Jawahar G, Sagot B, Seddah D. What does BERT Learn about the Structure of Language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 3651-3657.
[20] Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing[J]. ACM Transactions on Computing for Healthcare, 2022, 3(1): 2.
[21] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[22] He Y, Zhu Z W, Zhang Y, et al. Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 4604-4614.
[23] D’Orsi C, Morris E, Mendelson E. ACR BI-RADS Atlas: Breast Imaging Reporting and Data System[M]. American College of Radiology, 2013.
[24] 中国抗癌协会乳腺癌专业委员会. 中国抗癌协会乳腺癌诊治指南与规范(2021年版)[J]. 中国癌症杂志, 2021, 31
[24] ( 10:954-1040. (Breast Cancer Committee of Chinese Anti-Cancer Association. Chinese Guidelines for Diagnosis and Treatment of Breast Cancer[J]. China Oncology, 2021, 31(10): 954-1040.)
[25] Sung H, Ferlay J, Siegel R L, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2021, 71(3): 209-249.
doi: 10.3322/caac.21660 pmid: 33538338
[26] Gao H Y, Bowles E J A, Carrell D, et al. Using Natural Language Processing to Extract Mammographic Findings[J]. Journal of Biomedical Informatics, 2015, 54: 77-84.
doi: 10.1016/j.jbi.2015.01.010 pmid: 25661260
[27] Castro S M, Tseytlin E, Medvedeva O, et al. Automated Annotation and Classification of BI-RADS Assessment from Radiology Reports[J]. Journal of Biomedical Informatics, 2017, 69: 177-187.
doi: S1532-0464(17)30081-3 pmid: 28428140
[28] Forsyth A W, Barzilay R, Hughes K S, et al. Machine Learning Methods to Extract Documentation of Breast Cancer Symptoms from Electronic Health Records[J]. Journal of Pain and Symptom Management, 2018, 55(6): 1492-1499.
doi: S0885-3924(18)30082-4 pmid: 29496537
[29] Short R G, Bralich J, Bogaty D, et al. Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach[J]. Journal of Digital Imaging, 2019, 32(5): 685-692.
doi: 10.1007/s10278-018-0141-4 pmid: 30338478
[30] Miao S M, Xu T Y, Wu Y H, et al. Extraction of BI-RADS Findings from Breast Ultrasound Reports in Chinese Using Deep Learning Approaches[J]. International Journal of Medical Informatics, 2018, 119: 17-21.
doi: S1386-5056(18)30922-5 pmid: 30342682
[31] Zhou M T, Tang T L, Lu J, et al. Extracting BI-RADS Features from Mammography Reports in Chinese Based on Machine Learning[J]. Journal of Flow Visualization and Image Processing, 2021, 28(2): 55-68.
doi: 10.1615/JFlowVisImageProc.v28.i2
[32] Ling Y, Hasan S A, Farri O, et al. A Domain Knowledge-Enhanced LSTM-CRF Model for Disease Named Entity Recognition[J]. AMIA Joint Summits on Translational Science Proceedings, 2019, 2019: 761-770.
pmid: 31259033
[33] Li Y, Du G D, Xiang Y, et al. Towards Chinese Clinical Named Entity Recognition by Dynamic Embedding Using Domain-specific Knowledge[J]. Journal of Biomedical Informatics, 2020, 106: 103435.
doi: 10.1016/j.jbi.2020.103435
[34] 李纲, 潘荣清, 毛进, 等. 整合BiLSTM-CRF网络和词典资源的中文电子病历实体识别[J]. 现代情报, 2020, 40(4): 3-12, 58.
doi: 10.3969/j.issn.1008-0821.2020.04.001
[34] ( Li Gang, Pan Rongqing, Mao Jin, et al. Entity Recognition of Chinese Electronic Medical Records Based on BiLSTM-CRF Network and Dictionary Resources[J]. Journal of Modern Information, 2020, 40(4): 3-12, 58.)
doi: 10.3969/j.issn.1008-0821.2020.04.001
[35] Zhang Y Y, Xu J, Chen H, et al. Chemical Named Entity Recognition in Patents by Domain Knowledge and Unsupervised Feature Learning[J]. Database: The Journal of Biological Databases and Curation, 2016. DOI: 10.1093/database/baw049.
doi: 10.1093/database/baw049
[36] Kipf T N, Welling M. Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv: 1609.02907.
[37] Chung J, Gülçehre Ç, Cho K, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint, arXiv: 1412.3555.
[38] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 260-270.
[39] Kombrink S, Mikolov T, Karafiát M, et al. Recurrent Neural Network Based Language Modeling in Meeting Recognition[C]// Proceedings of the 12th Annual Conference of the International Speech Communication Association. 2011: 2877-2880.
[40] Hochreiter S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 1998, 6(2): 107-116.
doi: 10.1142/S0218488598000094
[41] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[42] Schuster M, Paliwal K K. Bidirectional Recurrent Neural Networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
doi: 10.1109/78.650093
[43] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[44] Douzas G, Bacao F. Geometric SMOTE a Geometrically Enhanced Drop-in Replacement for SMOTE[J]. Information Sciences, 2019, 501: 118-135.
doi: 10.1016/j.ins.2019.06.007
[45] Zhang C, Tan K C, Li H Z, et al. A Cost-Sensitive Deep Belief Network for Imbalanced Classification[J]. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30(1): 109-122.
doi: 10.1109/TNNLS.2018.2832648 pmid: 29993587
[1] 成全, 佘德昕. 融合患者体征与用药数据的图神经网络药物推荐方法研究*[J]. 数据分析与知识发现, 2022, 6(9): 113-124.
[2] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[3] 王露, 乐小虬. 基于句法依赖增强的主题-问题实例识别方法研究[J]. 数据分析与知识发现, 2022, 6(12): 13-22.
[4] 王洁,高原,张蕾,马力文,冯筠. 基于因果分析图的城市交通流短时预测研究*[J]. 数据分析与知识发现, 2022, 6(11): 111-125.
[5] 顾耀文,郑思,杨丰春,李姣. 基于图神经网络的抗结核杆菌药物虚拟筛选模型的建立及应用*[J]. 数据分析与知识发现, 2022, 6(11): 93-102.
[6] 冯小东, 惠康欣. 基于异构图神经网络的社交媒体文本主题聚类*[J]. 数据分析与知识发现, 2022, 6(10): 9-19.
[7] 黄学坚, 刘雨飏, 马廷淮. 基于改进型图神经网络的学术论文分类模型*[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[8] 顾耀文, 张博文, 郑思, 杨丰春, 李姣. 基于图注意力网络的药物ADMET分类预测模型构建方法*[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn