Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 251-262    DOI: 10.11925/infotech.2096-3467.2021.0910
Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF
Zhang Fangcong1,Qin Qiuli1(),Jiang Yong2,Zhuang Runtao3
1School of Economics and Management, Beijing Jiaotong University, Beijing 100044, China
2National Clinical Medical Research Center for Nervous System Diseases, Beijing Tiantan Hospital Affiliated to Capital Medical University, Beijing 100050, China
3Community Health Service Center, Beijing Jiaotong University, Beijing 100044, China
[Objective] This study tries to address the issues of polysemy and incomplete words facing entity recognition for Chinese Electronic Medical Records (EMR). [Methods] We constructed a deep learning model RoBERTa-WWM-BiLSTM-CRF to improve the named entity recognition of Chinese EMR. We conducted four rounds of experiments to compare their impacts on entity recognition. [Results] The highest F1 value of the new model reached 0.8908. [Limitations] The experiment data set is small, and the entity recognition results of some departments was not very impressive. For example, the F1 value of respiratory department was only 0.8111. [Conclusions] The RoBERTa-WWM-BiLSTM-CRF model could effectively conduct named entity recognition for Chinese electronic medical records.

Key wordsNamed Entity Recognition      Deep Learning      Electronic Medical Records     
Received: 25 August 2021      Published: 14 April 2022
ZTFLH:  TP393  
Corresponding Authors: Qin Qiuli,ORCID:0000-0002-3787-8488     E-mail: qlqin

Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 251-262.

Structure Diagram of RoBERTa-WWM-BiLSTM-CRF Model
Structure Diagram of RoBERTa-WWM Model
Schematic Diagram of BERT Model
Schematic Diagram of RoBERTa Model
LSTM Cell Structure
Experimental Process of Chinese EMR Entity Recognition
超参数 取值
Dropout 0.5
Epoch 30
Batch_size 64
LSTM隐藏层维度 768
序列最大长度 512
学习率 0.000 1
Annotation Method
实验内容 模型 指标 疾病 症状 身体 检查 治疗 总体
不同的预训练模型 BiLSTM-CRF P 0.810 1 0.833 7 0.821 7 0.859 7 0.785 2 0.845 6
R 0.785 0 0.825 4 0.810 4 0.851 3 0.761 2 0.817 1
F1 0.797 4 0.829 5 0.816 0 0.855 4 0.773 0 0.831 1
BERT-BiLSTM-CRF P 0.821 7 0.901 1 0.835 4 0.860 3 0.795 6 0.854 1
R 0.802 6 0.910 2 0.820 3 0.862 4 0.803 7 0.826 8
F1 0.812 0 0.905 6 0.827 8 0.861 3 0.799 6 0.840 2
本文实验模型 RoBERTa-WWM-BiLSTM-CRF P 0.835 4 0.942 7 0.878 9 0.911 8 0.825 0 0.890 8
R 0.899 2 0.920 3 0.844 5 0.857 2 0.798 1 0.890 7
F1 0.866 1 0.931 3 0.861 4 0.883 7 0.811 3 0.890 8
不同的下游模型结构 RoBERTa-WWM-CRF P 0.828 7 0.918 4 0.841 3 0.890 4 0.802 6 0.849 3
R 0.803 6 0.905 4 0.826 1 0.830 2 0.762 9 0.842 1
F1 0.815 9 0.911 8 0.833 6 0.859 2 0.782 2 0.845 7
RoBERTa-WWM-LSTM-CRF P 0.829 9 0.920 7 0.850 7 0.901 5 0.812 6 0.852 1
R 0.830 7 0.913 5 0.832 4 0.842 9 0.785 3 0.869 5
F1 0.830 3 0.917 1 0.841 5 0.871 2 0.798 7 0.860 7
Experimental Result
Results of Different Annotation Methods
指标 泌尿
神经科 普外科 心血管内科 骨科 呼吸科 肿瘤科
P 0.814 1 0.851 7 0.852 9 0.898 9 0.854 2 0.811 6 0.895 6
R 0.811 9 0.847 3 0.852 8 0.897 3 0.853 7 0.810 5 0.886 1
F1 0.813 5 0.849 5 0.852 8 0.898 0 0.854 5 0.811 1 0.890 8
Entity Recognition Results by Department
