Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (2/3): 251-262     https://doi.org/10.11925/infotech.2096-3467.2021.0910
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究
张芳丛1,秦秋莉1(),姜勇2,庄润涛3
1北京交通大学经济管理学院 北京 100044
2首都医科大学附属北京天坛医院国家神经系统疾病临床医学研究中心 北京 100050
3北京交通大学社区卫生服务中心 北京 100044
Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF
Zhang Fangcong1,Qin Qiuli1(),Jiang Yong2,Zhuang Runtao3
1School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China
2National Clinical Medical Research Center for Nervous System Diseases, Beijing Tiantan Hospital Affiliated to Capital Medical University, Beijing 100050, China
3Community Health Service Center, Beijing Jiaotong University, Beijing 100044, China
全文: PDF (1121 KB)   HTML ( 29
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决中文电子病历实体识别中存在的一词多义、词识别不全等问题。【方法】 采用深度学习模型RoBERTa-WWM-BiLSTM-CRF,改善中文电子病历的命名实体识别的效果并用4组实验进行对比,分析不同模型对中文电子病历实体识别的效果的影响。【结果】 所提模型的实体识别效果F1值达到了0.890 8。【局限】 使用的数据集规模较小,部分科室实体识别效果较一般,如呼吸科F1值仅为0.811 1。【结论】 通过实验表明RoBERTa-WWM-BiLSTM-CRF模型更适用于中文电子病历命名实体识别任务,有效解决了中文电子病历命名实体识别中存在的一词多义和词识别不全的问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张芳丛
秦秋莉
姜勇
庄润涛
关键词 命名实体识别深度学习电子病历    
Abstract

[Objective] This study tries to address the issues of polysemy and incomplete words facing entity recognition for Chinese Electronic Medical Records (EMR). [Methods] We constructed a deep learning model RoBERTa-WWM-BiLSTM-CRF to improve the named entity recognition of Chinese EMR. We conducted four rounds of experiments to compare their impacts on entity recognition. [Results] The highest F1 value of the new model reached 0.8908. [Limitations] The experiment data set is small, and the entity recognition results of some departments was not very impressive. For example, the F1 value of respiratory department was only 0.8111. [Conclusions] The RoBERTa-WWM-BiLSTM-CRF model could effectively conduct named entity recognition for Chinese electronic medical records.

Key wordsNamed Entity Recognition    Deep Learning    Electronic Medical Records
收稿日期: 2021-08-25      出版日期: 2022-04-14
ZTFLH:  TP393  
通讯作者: 秦秋莉,ORCID:0000-0002-3787-8488     E-mail: qlqin @bjtu.edu.cn
引用本文:   
张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 251-262.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0910      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/251
Fig.1  RoBERTa-WWM-BiLSTM-CRF模型总体结构图
Fig.2  RoBERTa-WWM模型结构
Fig.3  BERT模型示意图
Fig.4  RoBERTa模型示意图
Fig.5  LSTM单元结构
Fig.6  中文电子病历实体识别实验流程
超参数 取值
Dropout 0.5
Epoch 30
Batch_size 64
LSTM隐藏层维度 768
序列最大长度 512
学习率 0.000 1
Table 1  标注方式
实验内容 模型 指标 疾病 症状 身体 检查 治疗 总体
不同的预训练模型 BiLSTM-CRF P 0.810 1 0.833 7 0.821 7 0.859 7 0.785 2 0.845 6
R 0.785 0 0.825 4 0.810 4 0.851 3 0.761 2 0.817 1
F1 0.797 4 0.829 5 0.816 0 0.855 4 0.773 0 0.831 1
BERT-BiLSTM-CRF P 0.821 7 0.901 1 0.835 4 0.860 3 0.795 6 0.854 1
R 0.802 6 0.910 2 0.820 3 0.862 4 0.803 7 0.826 8
F1 0.812 0 0.905 6 0.827 8 0.861 3 0.799 6 0.840 2
本文实验模型 RoBERTa-WWM-BiLSTM-CRF P 0.835 4 0.942 7 0.878 9 0.911 8 0.825 0 0.890 8
R 0.899 2 0.920 3 0.844 5 0.857 2 0.798 1 0.890 7
F1 0.866 1 0.931 3 0.861 4 0.883 7 0.811 3 0.890 8
不同的下游模型结构 RoBERTa-WWM-CRF P 0.828 7 0.918 4 0.841 3 0.890 4 0.802 6 0.849 3
R 0.803 6 0.905 4 0.826 1 0.830 2 0.762 9 0.842 1
F1 0.815 9 0.911 8 0.833 6 0.859 2 0.782 2 0.845 7
RoBERTa-WWM-LSTM-CRF P 0.829 9 0.920 7 0.850 7 0.901 5 0.812 6 0.852 1
R 0.830 7 0.913 5 0.832 4 0.842 9 0.785 3 0.869 5
F1 0.830 3 0.917 1 0.841 5 0.871 2 0.798 7 0.860 7
Table 2  实验结果
Fig.7  不同标注方式的实验结果
指标 泌尿
外科
神经科 普外科 心血管内科 骨科 呼吸科 肿瘤科
P 0.814 1 0.851 7 0.852 9 0.898 9 0.854 2 0.811 6 0.895 6
R 0.811 9 0.847 3 0.852 8 0.897 3 0.853 7 0.810 5 0.886 1
F1 0.813 5 0.849 5 0.852 8 0.898 0 0.854 5 0.811 1 0.890 8
Table 3  分科室实体识别结果
[1] Grishman R, Sundheim B. Message Understanding Conference 6: A Brief History[C]// Proceedings of the 16th Conference on Computational Linguistics-Volume 1. 1996: 446-471.
[2] Jang H J, Cho K O. Applications of Deep Learning for the Analysis of Medical Data[J]. Archives of Pharmacal Research, 2019, 42(6):492-504.
doi: 10.1007/s12272-019-01162-9
[3] Ubbens J R, Stavness I. Deep Plant Phenomics: A Deep Learning Platform for Complex Plant Phenotyping Tasks[J]. Frontiers in Plant Science, 2017, 8:1190.
doi: 10.3389/fpls.2017.01190
[4] Belle A, Thiagarajan R, Soroushmehr S M R, et al. Big Data Analytics in Healthcare[J]. BioMed Research International, 2015: 1-16.
[5] Shen L, Li Q, Wang W, et al. Treatment Patterns and Direct Medical Costs of Metastatic Colorectal Cancer Patients: A Retrospective Study of Electronic Medical Records from Urban China[J]. Journal of Medical Economics, 2020, 23(5):456-463.
doi: 10.1080/13696998.2020.1717500 pmid: 31950863
[6] Friedman C, Kra P, Rzhetsky A. Two Biomedical Sublanguages: A Description Based on the Theories of Zellig Harris[J]. Journal of Biomedical Informatics, 2002, 35(4):222-235.
pmid: 12755517
[7] 杨锦锋, 于秋滨, 关毅, 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014, 40(8):1537-1562.
[7] ( Yang Jinfeng, Yu Qiubin, Guan Yi, et al. An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatica Sinica, 2014, 40(8):1537-1562.)
[8] Ganslandt T, Prokosch H U. Perspectives for Medical Informatics[J]. Methods of Information in Medicine, 2009, 48(1):38-44.
pmid: 19151882
[9] Greenes R A, Shortliffe E H. Medical Informatics: An Emerging Academic Discipline and Institutional Priority[J]. JAMA: The Journal of the American Medical Association, 1990, 263(8):1114-1120.
doi: 10.1001/jama.1990.03440080092030
[10] 李春颖, 朱兰, 郎景和, 等. 尿失禁诊断问卷简体中文版的信度和效度评价[J]. 中华妇产科杂志, 2016, 51(5):357-360.
[10] ( Li Chunying, Zhu Lan, Lang Jinghe, et al. Exploratory and Confirmatory Factor Analyses for Testing Validity and Reliability of the Chinese Language Questionnaire for Urinary Incontinence Diagnosis[J]. Chinese Journal of Obstetrics and Gynecology, 2016, 51(5):357-360.)
[11] Kim H K. Health Informatics: A Telehealth User- Friendly Design Monitoring Approach[J]. International Journal of Control and Automation, 2017, 10(12):89-98.
[12] Gardner R M, Overhage J M, Steen E B, et al. Core Content for the Subspecialty of Clinical Informatics[J]. Journal of the American Medical Informatics Association, 2009, 16(2):153-157.
doi: 10.1197/jamia.M3045 pmid: 19074296
[13] Frankovich J, Longhurst C A, Sutherland S M. Evidence-Based Medicine in the EMR Era[J]. The New England Journal of Medicine, 2011, 365(19):1758-1759.
doi: 10.1056/NEJMp1108726 pmid: 22047518
[14] Fowler S A, Yaeger L H, Yu F, et al. Electronic Health Record: Integrating Evidence-Based Information at the Point of Clinical Decision Making[J]. Journal of the Medical Library Association, 2014, 102(1):52-55.
doi: 10.3163/1536-5050.102.1.010 pmid: 24415920
[15] Eysenbach G. Consumer Health Informatics[J]. British Medical Journal (Clinical researched.), 2000, 320(7251):1713.
[16] Alpay L, Verhoef J, Xie B, et al. Current Challenge in Consumer Health Informatics: Bridging the Gap Between Access to Information and Information Understanding[J]. Biomedical Informatics Insights, 2009, 2(1):1-10.
[17] Wiesner M, Pfeifer D. Health Recommender Systems: Concepts, Requirements, Technical Basics and Challenges[J]. International Journal of Environmental Research and Public Health, 2014, 11(3):2580-2607.
doi: 10.3390/ijerph110302580 pmid: 24595212
[18] Vapnik V N, Lerner A Y. Recognition of Patterns with Help of Generalized Portraits[J]. Avtomatika i Telemekhanika, 1963, 24(6):774-780.
[19] Zhou G, Shen D, Zhang J, et al. Recognition of Protein/Gene Names from Text Using an Ensemble of Classifiers[J]. BMC Bioinformatics, 2005, 6(S1):S7.
[20] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[21] Jonnalagadda S, Cohen T, Wu S, et al. Enhancing Clinical Concept Extraction with Distributional Semantics[J]. Journal of Biomedical Informatics, 2012, 45(1):129-140.
doi: 10.1016/j.jbi.2011.10.007 pmid: 22085698
[22] Jiang M, Chen Y K, Liu M, et al. A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries[J]. Journal of the American Medical Informatics Association, 2011, 18(5):601-606.
doi: 10.1136/amiajnl-2011-000163 pmid: 21508414
[23] Cocos A, Fiks A G, Masino A J. Deep Learning for Pharmacovigilance: Recurrent Neural Network Architectures for Labeling Adverse Drug Reactions in Twitter Posts[J]. Journal of the American Medical Informatics Association, 2017, 24(4):813-821.
doi: 10.1093/jamia/ocw180 pmid: 28339747
[24] 张帆, 王敏. 基于深度学习的医疗命名实体识别[J]. 计算技术与自动化, 2017, 36(1):123-127.
[24] ( Zhang Fan, Wang Min. Medical Text Entities Recognition Method Base on Deep Learning[J]. Computing Technology and Automation, 2017, 36(1):123-127.)
[25] LeCun Y, Bengio Y. Convolutional Networks for Images, Speech, and Time Series[A]// The Handbook of Brain Theory and Neural Networks[M]. 1998: 255-258.
[26] Schmidhuber J. Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015, 61:85-117.
pmid: 25462637
[27] Sundermeyer M, Schlüter R, Ney H. LSTM Neural Networks for Language Modeling[C]// Proceedings of Interspeech 2012. 2012:601-608.
[28] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[29] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[30] Wu Y, Jiang M, Xu J, et al. Clinical Named Entity Recognition Using Deep Learning Models[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2017, 2017:1812-1819.
[31] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
pmid: 9377276
[32] Fukada T, Schuster M, Sagisaka Y. Phoneme Boundary Estimation Using Bidirectional Recurrent Neural Networks and Its Applications[J]. Systems and Computers in Japan, 1999, 30(4):20-30.
[33] Habibi M, Weber L, Neves M, et al. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition[J]. Bioinformatics, 2017, 33(14):i37-i48.
doi: 10.1093/bioinformatics/btx228
[34] Wang X, Zhang Y, Ren X, et al. Cross-Type Biomedical Named Entity Recognition with Deep Multi-Task Learning[J]. Bioinformatics, 2019, 35(10):1745-1752.
doi: 10.1093/bioinformatics/bty869
[35] Topaz M, Murga L, Gaddis K M, et al. Mining Fall-Related Information in Clinical Notes: Comparison of Rule-Based and Novel Word Embedding-Based Machine Learning Approaches[J]. Journal of Biomedical Informatics, 2019, 90:103103.
doi: 10.1016/j.jbi.2019.103103
[36] 羊艳玲, 李燕, 钟昕妤, 等. 基于BiLSTM-CRF的中医医案命名实体识别[J]. 中医药息, 2021, 38(11):15-21.
[36] ( Yang Yanling, Li Yan, Zhong Xinyu, et al. Named Entity Recognition of TCM Medical Records Based on BiLSTM-CRF[J]. Information on Traditional Chinese Medicine, 2021, 38(11):15-21.)
[37] Kim Y M, Lee T H. Korean Clinical Entity Recognition from Diagnosis Text Using BERT[J]. BMC Medical Informatics and Decision Making, 2020, 20(suppl 7):242.
doi: 10.1186/s12911-020-01241-8
[38] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[39] Zhang W, Jiang S, Zhao S, et al. A BERT-BiLSTM-CRF Model for Chinese Electronic Medical Records Named Entity Recognition[C]// Proceedings of the 12th International Conference on Intelligent Computation Technology and Automation (ICICTA). 2019.
[40] Cho H, Lee H. Biomedical Named Entity Recognition Using Deep Neural Networks with Contextual Information[J]. BMC Bioinformatics, 2019, 20(1):735.
doi: 10.1186/s12859-019-3321-4
[41] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514.
doi: 10.1109/TASLP.2021.3124365
[42] Zhu X X, Li L X, Liu J, et al. Captioning Transformer with Stacked Attention Modules[J]. Applied Sciences, 2018, 8(5):739.
doi: 10.3390/app8050739
[43] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[44] Ma X L, Tao Z M, Wang Y H, et al. Long Short-Term Memory Neural Network for Traffic Speed Prediction Using Remote Microwave Sensor Data[J]. Transportation Research Part C: Emerging Technologies, 2015, 54:187-197.
doi: 10.1016/j.trc.2015.03.014
[45] Greff K, Srivastava R K, Koutnik J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924
[46] Cornegruta S, Bakewell R, Withey S, et al. Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks[C]// Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis. Austin: Association for Computational Linguistics, 2016: 17-27.
[47] 马孟铖, 杨晴雯, 艾斯卡尔·艾木都拉, 等. 基于词向量和条件随机场的中文命名实体分类[J]. 计算机工程与设计, 2020, 41(9):2515-2522.
[47] ( Ma Mengcheng, Yang Qingwen, Askar · Hamdulla, et al. Chinese Named Entity Classification Based on Word Vector and Conditional Random Fields[J]. Computer Engineering and Design, 2020, 41(9):2515-2522.)
[48] 柏兵, 侯霞, 石松. 基于CRF和BI-LSTM的命名实体识别方法[J]. 北京信息科技大学学报(自然科学版), 2018, 33(6):27-33.
[48] ( Bai Bing, Hou Xia, Shi Song. Named Entity Recognition Method Based on CRF and Bi-LSTM[J]. Journal of Beijing Information Science & Technology University, 2018, 33(6):27-33.)
[49] Zweig G, Nguyen P, van Compernolle D, et al. Speech Recognitionwith Segmental Conditional Random Fields: A Summary of the JHU CLSP 2010 Summer Workshop[C]// Proceedings of 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011: 5044-5047.
[50] Paszke A, Gross S, Chintala S, et al. Automatic Differentiation in PyTorch[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017.
[1] 张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[2] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[3] 张云秋, 李博诚, 陈妍. 面向不平衡数据的电子病历自动分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
[4] 胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[5] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[6] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[7] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[8] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[9] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[10] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[11] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[12] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[13] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[14] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[15] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn