Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 242-250    DOI: 10.11925/infotech.2096-3467.2021.0951
Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model
Zhang Yunqiu(),Wang Yang,Li Bocheng
School of Public Health, Jilin University, Changchun 130021, China
[Objective] This paper proposes an entity recognition model based on RoBERTa-wwm dynamic fusion, aiming to improve the entity identification of Chinese electronic medical records. [Methods] First, we merged the semantic representations generated by each Transformer layer of the pre-trained language model RoBERTa-wwm. Then, we input the bi-directional long short-term memory network and the conditional random field module to recognize the entities of the electronic medical records. [Results] We examined our new model with the dataset of “2017 National Knowledge Graph and Semantic Computing Conference (CCKS 2017)” and self-annotated electronic medical records. Their F1 values reached 94.08% and 90.08%, which were 0.23% and 0.39% higher than the RoBERTa-wwm-BiLSTM-CRF model. [Limitations] The RoBERTa-wwm used in this paper completed the pre-training process with non-medical corpus. [Conclusions] The proposed method could improve the results of entity recognition tasks.

Key wordsElectronic Medical Record      Named Entity Recognition      RoBERTa-wwm      Dynamic Fusion     
Received: 31 August 2021      Published: 07 January 2022
ZTFLH:  TP391  
Fund:Humanities and Social Science Foundation of Ministry of Education(18YJA870017);Jilin Province Social Science Foundation(2019B59);Graduate Innovation Foundation of Jilin University(101832020CX279)
Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 242-250.

RoBERTa-wwm Dynamic Fusion Model
Example of Whole Word Masking in RoBERTa-wwm
Dynamic Weight Fusion
LSTM Unit Structure
数据集 疾病诊断 症状体征 检查检验 身体部位 治疗
300份 1 209 7 538 9 995 9 844 1 470
4.02% 25.08% 33.25% 32.75% 4.89%
CCKS2017 Shared Task2 Entity Distribution
Dropout 0.5
隐藏层维度 768
优化器 Adam
学习率 0.000 1
Batch_size 32
Decay rate 0.8
LSTM_dim 256
Epoch 24
Max_seq_len 150
Step_size 2 000
Parameter Settings
模型 P/% R/% F1/%
BiLSTM-CRF 86.63 87.84 87.23
BERT-BiLSTM-CRF 93.10 94.27 93.68
RoBERTa-wwm-BiLSTM-CRF 93.25 94.45 93.85
本文模型 93.42 94.73 94.08
Evaluation Results of Each Model on the CCKS2017
模型 P/% R/% F1/%
BiLSTM-CRF 83.07 83.79 83.43
BERT-BiLSTM-CRF 88.93 89.57 89.24
RoBERTa-wwm-BiLSTM-CRF 89.62 89.75 89.69
本文模型 89.77 90.39 90.08
Evaluation Results of Each Model on the Self-Labeled Data Set
