Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 251-262    DOI: 10.11925/infotech.2096-3467.2021.0910
Current Issue | Archive | Adv Search |
Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF
Zhang Fangcong1,Qin Qiuli1(),Jiang Yong2,Zhuang Runtao3
1School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China
2National Clinical Medical Research Center for Nervous System Diseases, Beijing Tiantan Hospital Affiliated to Capital Medical University, Beijing 100050, China
3Community Health Service Center, Beijing Jiaotong University, Beijing 100044, China
Download: PDF (1121 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to address the issues of polysemy and incomplete words facing entity recognition for Chinese Electronic Medical Records (EMR). [Methods] We constructed a deep learning model RoBERTa-WWM-BiLSTM-CRF to improve the named entity recognition of Chinese EMR. We conducted four rounds of experiments to compare their impacts on entity recognition. [Results] The highest F1 value of the new model reached 0.8908. [Limitations] The experiment data set is small, and the entity recognition results of some departments was not very impressive. For example, the F1 value of respiratory department was only 0.8111. [Conclusions] The RoBERTa-WWM-BiLSTM-CRF model could effectively conduct named entity recognition for Chinese electronic medical records.

Key wordsNamed Entity Recognition      Deep Learning      Electronic Medical Records     
Received: 25 August 2021      Published: 14 April 2022
ZTFLH:  TP393  
Corresponding Authors: Qin Qiuli,ORCID:0000-0002-3787-8488     E-mail: qlqin @bjtu.edu.cn

Cite this article:

Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 251-262.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0910     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/251

Structure Diagram of RoBERTa-WWM-BiLSTM-CRF Model
Structure Diagram of RoBERTa-WWM Model
Schematic Diagram of BERT Model
Schematic Diagram of RoBERTa Model
LSTM Cell Structure
Experimental Process of Chinese EMR Entity Recognition
超参数 取值
Dropout 0.5
Epoch 30
Batch_size 64
LSTM隐藏层维度 768
序列最大长度 512
学习率 0.000 1
Annotation Method
实验内容 模型 指标 疾病 症状 身体 检查 治疗 总体
不同的预训练模型 BiLSTM-CRF P 0.810 1 0.833 7 0.821 7 0.859 7 0.785 2 0.845 6
R 0.785 0 0.825 4 0.810 4 0.851 3 0.761 2 0.817 1
F1 0.797 4 0.829 5 0.816 0 0.855 4 0.773 0 0.831 1
BERT-BiLSTM-CRF P 0.821 7 0.901 1 0.835 4 0.860 3 0.795 6 0.854 1
R 0.802 6 0.910 2 0.820 3 0.862 4 0.803 7 0.826 8
F1 0.812 0 0.905 6 0.827 8 0.861 3 0.799 6 0.840 2
本文实验模型 RoBERTa-WWM-BiLSTM-CRF P 0.835 4 0.942 7 0.878 9 0.911 8 0.825 0 0.890 8
R 0.899 2 0.920 3 0.844 5 0.857 2 0.798 1 0.890 7
F1 0.866 1 0.931 3 0.861 4 0.883 7 0.811 3 0.890 8
不同的下游模型结构 RoBERTa-WWM-CRF P 0.828 7 0.918 4 0.841 3 0.890 4 0.802 6 0.849 3
R 0.803 6 0.905 4 0.826 1 0.830 2 0.762 9 0.842 1
F1 0.815 9 0.911 8 0.833 6 0.859 2 0.782 2 0.845 7
RoBERTa-WWM-LSTM-CRF P 0.829 9 0.920 7 0.850 7 0.901 5 0.812 6 0.852 1
R 0.830 7 0.913 5 0.832 4 0.842 9 0.785 3 0.869 5
F1 0.830 3 0.917 1 0.841 5 0.871 2 0.798 7 0.860 7
Experimental Result
Results of Different Annotation Methods
指标 泌尿
外科
神经科 普外科 心血管内科 骨科 呼吸科 肿瘤科
P 0.814 1 0.851 7 0.852 9 0.898 9 0.854 2 0.811 6 0.895 6
R 0.811 9 0.847 3 0.852 8 0.897 3 0.853 7 0.810 5 0.886 1
F1 0.813 5 0.849 5 0.852 8 0.898 0 0.854 5 0.811 1 0.890 8
Entity Recognition Results by Department
[1] Grishman R, Sundheim B. Message Understanding Conference 6: A Brief History[C]// Proceedings of the 16th Conference on Computational Linguistics-Volume 1. 1996: 446-471.
[2] Jang H J, Cho K O. Applications of Deep Learning for the Analysis of Medical Data[J]. Archives of Pharmacal Research, 2019, 42(6):492-504.
doi: 10.1007/s12272-019-01162-9
[3] Ubbens J R, Stavness I. Deep Plant Phenomics: A Deep Learning Platform for Complex Plant Phenotyping Tasks[J]. Frontiers in Plant Science, 2017, 8:1190.
doi: 10.3389/fpls.2017.01190
[4] Belle A, Thiagarajan R, Soroushmehr S M R, et al. Big Data Analytics in Healthcare[J]. BioMed Research International, 2015: 1-16.
[5] Shen L, Li Q, Wang W, et al. Treatment Patterns and Direct Medical Costs of Metastatic Colorectal Cancer Patients: A Retrospective Study of Electronic Medical Records from Urban China[J]. Journal of Medical Economics, 2020, 23(5):456-463.
doi: 10.1080/13696998.2020.1717500 pmid: 31950863
[6] Friedman C, Kra P, Rzhetsky A. Two Biomedical Sublanguages: A Description Based on the Theories of Zellig Harris[J]. Journal of Biomedical Informatics, 2002, 35(4):222-235.
pmid: 12755517
[7] 杨锦锋, 于秋滨, 关毅, 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014, 40(8):1537-1562.
[7] ( Yang Jinfeng, Yu Qiubin, Guan Yi, et al. An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatica Sinica, 2014, 40(8):1537-1562.)
[8] Ganslandt T, Prokosch H U. Perspectives for Medical Informatics[J]. Methods of Information in Medicine, 2009, 48(1):38-44.
pmid: 19151882
[9] Greenes R A, Shortliffe E H. Medical Informatics: An Emerging Academic Discipline and Institutional Priority[J]. JAMA: The Journal of the American Medical Association, 1990, 263(8):1114-1120.
doi: 10.1001/jama.1990.03440080092030
[10] 李春颖, 朱兰, 郎景和, 等. 尿失禁诊断问卷简体中文版的信度和效度评价[J]. 中华妇产科杂志, 2016, 51(5):357-360.
[10] ( Li Chunying, Zhu Lan, Lang Jinghe, et al. Exploratory and Confirmatory Factor Analyses for Testing Validity and Reliability of the Chinese Language Questionnaire for Urinary Incontinence Diagnosis[J]. Chinese Journal of Obstetrics and Gynecology, 2016, 51(5):357-360.)
[11] Kim H K. Health Informatics: A Telehealth User- Friendly Design Monitoring Approach[J]. International Journal of Control and Automation, 2017, 10(12):89-98.
[12] Gardner R M, Overhage J M, Steen E B, et al. Core Content for the Subspecialty of Clinical Informatics[J]. Journal of the American Medical Informatics Association, 2009, 16(2):153-157.
doi: 10.1197/jamia.M3045 pmid: 19074296
[13] Frankovich J, Longhurst C A, Sutherland S M. Evidence-Based Medicine in the EMR Era[J]. The New England Journal of Medicine, 2011, 365(19):1758-1759.
doi: 10.1056/NEJMp1108726 pmid: 22047518
[14] Fowler S A, Yaeger L H, Yu F, et al. Electronic Health Record: Integrating Evidence-Based Information at the Point of Clinical Decision Making[J]. Journal of the Medical Library Association, 2014, 102(1):52-55.
doi: 10.3163/1536-5050.102.1.010 pmid: 24415920
[15] Eysenbach G. Consumer Health Informatics[J]. British Medical Journal (Clinical researched.), 2000, 320(7251):1713.
[16] Alpay L, Verhoef J, Xie B, et al. Current Challenge in Consumer Health Informatics: Bridging the Gap Between Access to Information and Information Understanding[J]. Biomedical Informatics Insights, 2009, 2(1):1-10.
[17] Wiesner M, Pfeifer D. Health Recommender Systems: Concepts, Requirements, Technical Basics and Challenges[J]. International Journal of Environmental Research and Public Health, 2014, 11(3):2580-2607.
doi: 10.3390/ijerph110302580 pmid: 24595212
[18] Vapnik V N, Lerner A Y. Recognition of Patterns with Help of Generalized Portraits[J]. Avtomatika i Telemekhanika, 1963, 24(6):774-780.
[19] Zhou G, Shen D, Zhang J, et al. Recognition of Protein/Gene Names from Text Using an Ensemble of Classifiers[J]. BMC Bioinformatics, 2005, 6(S1):S7.
[20] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[21] Jonnalagadda S, Cohen T, Wu S, et al. Enhancing Clinical Concept Extraction with Distributional Semantics[J]. Journal of Biomedical Informatics, 2012, 45(1):129-140.
doi: 10.1016/j.jbi.2011.10.007 pmid: 22085698
[22] Jiang M, Chen Y K, Liu M, et al. A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries[J]. Journal of the American Medical Informatics Association, 2011, 18(5):601-606.
doi: 10.1136/amiajnl-2011-000163 pmid: 21508414
[23] Cocos A, Fiks A G, Masino A J. Deep Learning for Pharmacovigilance: Recurrent Neural Network Architectures for Labeling Adverse Drug Reactions in Twitter Posts[J]. Journal of the American Medical Informatics Association, 2017, 24(4):813-821.
doi: 10.1093/jamia/ocw180 pmid: 28339747
[24] 张帆, 王敏. 基于深度学习的医疗命名实体识别[J]. 计算技术与自动化, 2017, 36(1):123-127.
[24] ( Zhang Fan, Wang Min. Medical Text Entities Recognition Method Base on Deep Learning[J]. Computing Technology and Automation, 2017, 36(1):123-127.)
[25] LeCun Y, Bengio Y. Convolutional Networks for Images, Speech, and Time Series[A]// The Handbook of Brain Theory and Neural Networks[M]. 1998: 255-258.
[26] Schmidhuber J. Deep Learning in Neural Networks: An Overview[J]. Neural Networks, 2015, 61:85-117.
pmid: 25462637
[27] Sundermeyer M, Schlüter R, Ney H. LSTM Neural Networks for Language Modeling[C]// Proceedings of Interspeech 2012. 2012:601-608.
[28] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[29] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[30] Wu Y, Jiang M, Xu J, et al. Clinical Named Entity Recognition Using Deep Learning Models[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2017, 2017:1812-1819.
[31] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
pmid: 9377276
[32] Fukada T, Schuster M, Sagisaka Y. Phoneme Boundary Estimation Using Bidirectional Recurrent Neural Networks and Its Applications[J]. Systems and Computers in Japan, 1999, 30(4):20-30.
[33] Habibi M, Weber L, Neves M, et al. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition[J]. Bioinformatics, 2017, 33(14):i37-i48.
doi: 10.1093/bioinformatics/btx228
[34] Wang X, Zhang Y, Ren X, et al. Cross-Type Biomedical Named Entity Recognition with Deep Multi-Task Learning[J]. Bioinformatics, 2019, 35(10):1745-1752.
doi: 10.1093/bioinformatics/bty869
[35] Topaz M, Murga L, Gaddis K M, et al. Mining Fall-Related Information in Clinical Notes: Comparison of Rule-Based and Novel Word Embedding-Based Machine Learning Approaches[J]. Journal of Biomedical Informatics, 2019, 90:103103.
doi: 10.1016/j.jbi.2019.103103
[36] 羊艳玲, 李燕, 钟昕妤, 等. 基于BiLSTM-CRF的中医医案命名实体识别[J]. 中医药息, 2021, 38(11):15-21.
[36] ( Yang Yanling, Li Yan, Zhong Xinyu, et al. Named Entity Recognition of TCM Medical Records Based on BiLSTM-CRF[J]. Information on Traditional Chinese Medicine, 2021, 38(11):15-21.)
[37] Kim Y M, Lee T H. Korean Clinical Entity Recognition from Diagnosis Text Using BERT[J]. BMC Medical Informatics and Decision Making, 2020, 20(suppl 7):242.
doi: 10.1186/s12911-020-01241-8
[38] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[39] Zhang W, Jiang S, Zhao S, et al. A BERT-BiLSTM-CRF Model for Chinese Electronic Medical Records Named Entity Recognition[C]// Proceedings of the 12th International Conference on Intelligent Computation Technology and Automation (ICICTA). 2019.
[40] Cho H, Lee H. Biomedical Named Entity Recognition Using Deep Neural Networks with Contextual Information[J]. BMC Bioinformatics, 2019, 20(1):735.
doi: 10.1186/s12859-019-3321-4
[41] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29:3504-3514.
doi: 10.1109/TASLP.2021.3124365
[42] Zhu X X, Li L X, Liu J, et al. Captioning Transformer with Stacked Attention Modules[J]. Applied Sciences, 2018, 8(5):739.
doi: 10.3390/app8050739
[43] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[44] Ma X L, Tao Z M, Wang Y H, et al. Long Short-Term Memory Neural Network for Traffic Speed Prediction Using Remote Microwave Sensor Data[J]. Transportation Research Part C: Emerging Technologies, 2015, 54:187-197.
doi: 10.1016/j.trc.2015.03.014
[45] Greff K, Srivastava R K, Koutnik J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924
[46] Cornegruta S, Bakewell R, Withey S, et al. Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks[C]// Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis. Austin: Association for Computational Linguistics, 2016: 17-27.
[47] 马孟铖, 杨晴雯, 艾斯卡尔·艾木都拉, 等. 基于词向量和条件随机场的中文命名实体分类[J]. 计算机工程与设计, 2020, 41(9):2515-2522.
[47] ( Ma Mengcheng, Yang Qingwen, Askar · Hamdulla, et al. Chinese Named Entity Classification Based on Word Vector and Conditional Random Fields[J]. Computer Engineering and Design, 2020, 41(9):2515-2522.)
[48] 柏兵, 侯霞, 石松. 基于CRF和BI-LSTM的命名实体识别方法[J]. 北京信息科技大学学报(自然科学版), 2018, 33(6):27-33.
[48] ( Bai Bing, Hou Xia, Shi Song. Named Entity Recognition Method Based on CRF and Bi-LSTM[J]. Journal of Beijing Information Science & Technology University, 2018, 33(6):27-33.)
[49] Zweig G, Nguyen P, van Compernolle D, et al. Speech Recognitionwith Segmental Conditional Random Fields: A Summary of the JHU CLSP 2010 Summer Workshop[C]// Proceedings of 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2011: 5044-5047.
[50] Paszke A, Gross S, Chintala S, et al. Automatic Differentiation in PyTorch[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017.
[1] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[2] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[3] Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
[4] Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[5] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[6] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[7] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[8] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[9] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[10] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[11] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[12] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[13] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[14] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[15] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn