Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 233-241    DOI: 10.11925/infotech.2096-3467.2021.0954
Current Issue | Archive | Adv Search |
Automatic Classification with Unbalanced Data for Electronic Medical Records
Zhang Yunqiu1(),Li Bocheng1,Chen Yan2
1College of Public Health, Jilin University, Changchun 130021, China
2Shenzhen Health Development Research and Data Management Center, Shenzhen 518028, China
Download: PDF (829 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      

[Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.

Key wordsUnbalanced Data      Deep Learning      Electronic Medical Records      Cost-Sensitive Learning     
Received: 31 August 2021      Published: 14 April 2022
ZTFLH:  TP391  
Fund:Humanities and Social Science Foundation of Ministry of Education(18YJA870017);Entrusted Project of Shenzhen Medical Information Center(2020(261));Graduate Innovation Fund of Jilin University(101832020CX279)
Corresponding Authors: Zhang Yunqiu,ORCID:0000-0002-9790-9581     E-mail:

Cite this article:

Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 233-241.

URL:     OR

Framework of Classification Model for Unbalanced Data
Schematic Diagram of Disease Knowledge Network
Disease Distribution in Data Set
实验环境 实验配置
GPU GTX 1050TI(1块)
CPU E5-2678V3
开发环境 Python3.7.3 TensorFlow1.15.2
Epoch 20
LSTM学习率 0.001
Dropout 0.5
Experimental Environment Settings
混淆矩阵 预测值
Postive Negative
实际值 Postive TP FN
Negative FP TN
Confusion Matrix
方法 ACC Macro F1 Micro F1
BERT-base-Chinese 78.86% 63.48% 76.33%
ERNIE 80.23% 64.67% 79.67%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Pre-Training Models
方法 ACC Macro F1 Micro F1
Text-CNN 79.21% 64.33% 80.03%
BiLSTM 71.29% 58.21% 70.56%
BiGRU 72.43% 60.07% 72.23%
Text-CNN-BiGRU 80.44% 64.32% 80.21%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Neural Network Models
方法 ACC Macro F1 Micro F1
加权交叉熵 74.83% 60.22% 74.97%
Focal Loss 77.35% 63.88% 77.53%
GHM 80.57% 65.79% 79.92%
本文方法 81.37% 65.89% 81.47%
Comparison Results of Loss Function
[1] 卫生部, 国家中药管理局. 病历书写基本规范(试行)[J]. 中国卫生法制, 2002, 10(5):183-186.
[1] (Ministry of Health, State Administration of Traditional Chinese Medicine. Basic Norms for Medical Record Writing (for Trial Implementation)[J]. China Health Law, 2002, 10(5):183-186.)
[2] Uribe M O. International Classification of Diseases, World Health Organization, Tenth Version ICD-10[J]. Salud Mental, 1996, 19:11-18.
[3] Hirsch J A, Nicola G, McGinty G, et al. ICD-10: History and Context[J]. American Journal of Neuroradiology, 2016, 37(4):596-599.
doi: 10.3174/ajnr.A4696 pmid: 26822730
[4] Chen C W, Tseng S P, Kuan T W, et al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital[J]. Information, 2020, 11(2):106.
doi: 10.3390/info11020106
[5] 钟佳娃, 刘巍, 王思丽, 等. 文本情感分析方法及应用综述[J]. 数据分析与知识发现, 2021, 5(6):1-13.
[5] ( Zhong Jiawa, Liu Wei, Wang Sili, et al. Review of Methods and Applications of Text Sentiment Analysis[J]. Data Analysis and Knowledge Discovery, 2021, 5(6):1-13.)
[6] Koopman B, Zuccon G, Nguyen A, et al. Automatic ICD-10 Classification of Cancers from Free-Text Death Certificates[J]. International Journal of Medical Informatics, 2015, 84(11):956-965.
doi: 10.1016/j.ijmedinf.2015.08.004 pmid: 26323193
[7] Perotte A, Pivovarov R, Natarajan K, et al. Diagnosis Code Assignment: Models and Evaluation Metrics[J]. Journal of the American Medical Informatics Association, 2013, 21(2):231-237.
doi: 10.1136/amiajnl-2013-002159
[8] 孙松涛. 基于深度学习的微博复杂情感分类研究[D]. 武汉: 武汉大学, 2017.
[8] ( Sun Songtao. Research of Complex Emotion Classification for Microblog Based on Deep Learning[D]. Wuhan: Wuhan University, 2017.)
[9] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1746-1751.
[10] Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
[11] Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[OL]. arXiv Preprint, arXiv:1802.05695.
[12] Duarte F, Martins B, Pinto C S, et al. Deep Neural Models for ICD-10 Coding of Death Certificates and Autopsy Reports in Free-Text[J]. Journal of Biomedical Informatics, 2018, 80:64-77.
doi: 10.1016/j.jbi.2018.02.011
[13] Baumel T, Nassour-Kassis J, Elhadad M, et al. Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment[OL]. arXiv Preprint, arXiv: 1709.09587.
[14] 张虹科, 付振新, 任前平, 等. 基于融合条目词嵌入和注意力机制的自动ICD编码[J]. 北京大学学报(自然科学版), 2020, 56(1):1-8.
[14] ( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(1):1-8.)
[15] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
doi: 10.1613/jair.953
[16] Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
[17] He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IEEE, 2008: 1322-1328.
[18] Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001: 63-66.
[19] Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems Man & Cybernetics, 1976,SMC- 6(11):769-772.
[20] Ng W W Y, Hu J J, Yeung D S, et al. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11):2402-2412.
doi: 10.1109/TCYB.2014.2372060
[21] Ling C X, Li C. Data Mining for Direct Marketing: Problems and Solutions[C]// Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining. 1998, 98:73-79.
[22] Weiss G M, Provost F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction[J]. Journal of Artificial Intelligence Research, 2003, 19:315-354.
doi: 10.1613/jair.1199
[23] 翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[J]. 计算机科学, 2010, 37(10):27-32.
[23] ( Zhai Yun, Yang Bingru, Qu Wu. Survey of Mining Imbalanced Datasets[J]. Computer Science, 2010, 37(10):27-32.)
[24] Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): Article No.3.
[25] 薛安荣, 鞠时光, 何伟华, 等. 局部离群点挖掘算法研究[J]. 计算机学报, 2007, 30(8):1455-1463.
[25] ( Xue Anrong, Ju Shiguang, He Weihua, et al. Study on Algorithms for Local Outlier Detection[J]. Chinese Journal of Computers, 2007, 30(8):1455-1463.)
[26] 王飞. iLOF*: 一种改进的局部异常检测算法[J]. 计算机系统应用, 2015, 24(12):233-238.
[26] ( Wang Fei. iLOF*: An Optimized Local Outlier Detection Algorithm[J]. Computer Systems & Applications, 2015, 24(12):233-238.)
[27] Raskutti B, Kowalczyk A. Extreme re-Balancing for SVMS: A Case Study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):60-69.
doi: 10.1145/1007730.1007739
[28] Wilk T, Wozniak M. Soft Computing Methods Applied to Combination of One-Class Classifiers[J]. Neurocomputing, 2012, 75(1):185-193.
doi: 10.1016/j.neucom.2011.02.023
[29] Japkowicz N. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances[C]// Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence. 2001: 67-77.
[30] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017: 2999-3007.
[31] Li B Y, Liu Y, Wang X G. Gradient Harmonized Single-Stage Detector[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33:8577-8584.
[32] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[33] Zhang N, Jia Q, Yin K, et al. Conceptualized Representation Learning for Chinese Biomedical Text Mining[OL]. arXiv Preprint, arXiv:2008.10813.
[34] 王业沛. 基于深度学习的判决结果倾向性分析研究[D]. 南京: 南京大学, 2018.
[34] ( Wang Yepei. Orientation Analysis of Judgment Results Based on Deep Learning[D]. Nanjing: Nanjing University, 2018.)
[1] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[2] Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[3] Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[4] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[5] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[6] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[7] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[8] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[9] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[10] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[11] Feng Yong,Liu Yang,Xu Hongyan,Wang Rongbing,Zhang Yonggang. Recommendation Model Incorporating Neighbor Reviews for GRU Products[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[12] Hu Haotian,Ji Jinfeng,Wang Dongbo,Deng Sanhong. An Integrated Platform for Food Safety Incident Entities Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[13] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[14] Lv Xueqiang,Luo Yixiong,Li Jiaquan,You Xindong. Review of Studies on Detecting Chinese Patent Infringements[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[15] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938