Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 233-241    DOI: 10.11925/infotech.2096-3467.2021.0954
Automatic Classification with Unbalanced Data for Electronic Medical Records
Zhang Yunqiu1(),Li Bocheng1,Chen Yan2
1College of Public Health, Jilin University, Changchun 130021, China
2Shenzhen Health Development Research and Data Management Center, Shenzhen 518028, China
[Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.

Key wordsUnbalanced Data      Deep Learning      Electronic Medical Records      Cost-Sensitive Learning     
Received: 31 August 2021      Published: 14 April 2022
ZTFLH:  TP391  
Fund:Humanities and Social Science Foundation of Ministry of Education(18YJA870017);Entrusted Project of Shenzhen Medical Information Center(2020(261));Graduate Innovation Fund of Jilin University(101832020CX279)
Corresponding Authors: Zhang Yunqiu,ORCID:0000-0002-9790-9581     E-mail:

Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 233-241.

Framework of Classification Model for Unbalanced Data
Schematic Diagram of Disease Knowledge Network
Disease Distribution in Data Set
实验环境 实验配置
GPU GTX 1050TI(1块)
CPU E5-2678V3
开发环境 Python3.7.3 TensorFlow1.15.2
Epoch 20
LSTM学习率 0.001
Dropout 0.5
Experimental Environment Settings
混淆矩阵 预测值
Postive Negative
实际值 Postive TP FN
Negative FP TN
Confusion Matrix
方法 ACC Macro F1 Micro F1
BERT-base-Chinese 78.86% 63.48% 76.33%
ERNIE 80.23% 64.67% 79.67%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Pre-Training Models
方法 ACC Macro F1 Micro F1
Text-CNN 79.21% 64.33% 80.03%
BiLSTM 71.29% 58.21% 70.56%
BiGRU 72.43% 60.07% 72.23%
Text-CNN-BiGRU 80.44% 64.32% 80.21%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Neural Network Models
方法 ACC Macro F1 Micro F1
加权交叉熵 74.83% 60.22% 74.97%
Focal Loss 77.35% 63.88% 77.53%
GHM 80.57% 65.79% 79.92%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Loss Function
