Automatic Classification with Unbalanced Data for Electronic Medical Records
Zhang Yunqiu1(),Li Bocheng1,Chen Yan2
1College of Public Health, Jilin University, Changchun 130021, China 2Shenzhen Health Development Research and Data Management Center, Shenzhen 518028, China
[Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.
张云秋, 李博诚, 陈妍. 面向不平衡数据的电子病历自动分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 233-241.
(Ministry of Health, State Administration of Traditional Chinese Medicine. Basic Norms for Medical Record Writing (for Trial Implementation)[J]. China Health Law, 2002, 10(5):183-186.)
[2]
Uribe M O. International Classification of Diseases, World Health Organization, Tenth Version ICD-10[J]. Salud Mental, 1996, 19:11-18.
[3]
Hirsch J A, Nicola G, McGinty G, et al. ICD-10: History and Context[J]. American Journal of Neuroradiology, 2016, 37(4):596-599.
doi: 10.3174/ajnr.A4696
pmid: 26822730
[4]
Chen C W, Tseng S P, Kuan T W, et al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital[J]. Information, 2020, 11(2):106.
doi: 10.3390/info11020106
( Zhong Jiawa, Liu Wei, Wang Sili, et al. Review of Methods and Applications of Text Sentiment Analysis[J]. Data Analysis and Knowledge Discovery, 2021, 5(6):1-13.)
[6]
Koopman B, Zuccon G, Nguyen A, et al. Automatic ICD-10 Classification of Cancers from Free-Text Death Certificates[J]. International Journal of Medical Informatics, 2015, 84(11):956-965.
doi: 10.1016/j.ijmedinf.2015.08.004
pmid: 26323193
[7]
Perotte A, Pivovarov R, Natarajan K, et al. Diagnosis Code Assignment: Models and Evaluation Metrics[J]. Journal of the American Medical Informatics Association, 2013, 21(2):231-237.
doi: 10.1136/amiajnl-2013-002159
[8]
孙松涛. 基于深度学习的微博复杂情感分类研究[D]. 武汉: 武汉大学, 2017.
[8]
( Sun Songtao. Research of Complex Emotion Classification for Microblog Based on Deep Learning[D]. Wuhan: Wuhan University, 2017.)
[9]
Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1746-1751.
[10]
Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
[11]
Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[OL]. arXiv Preprint, arXiv:1802.05695.
[12]
Duarte F, Martins B, Pinto C S, et al. Deep Neural Models for ICD-10 Coding of Death Certificates and Autopsy Reports in Free-Text[J]. Journal of Biomedical Informatics, 2018, 80:64-77.
doi: 10.1016/j.jbi.2018.02.011
[13]
Baumel T, Nassour-Kassis J, Elhadad M, et al. Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment[OL]. arXiv Preprint, arXiv: 1709.09587.
( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(1):1-8.)
[15]
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
doi: 10.1613/jair.953
[16]
Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
[17]
He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IEEE, 2008: 1322-1328.
[18]
Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001: 63-66.
[19]
Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems Man & Cybernetics, 1976,SMC- 6(11):769-772.
[20]
Ng W W Y, Hu J J, Yeung D S, et al. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11):2402-2412.
doi: 10.1109/TCYB.2014.2372060
[21]
Ling C X, Li C. Data Mining for Direct Marketing: Problems and Solutions[C]// Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining. 1998, 98:73-79.
[22]
Weiss G M, Provost F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction[J]. Journal of Artificial Intelligence Research, 2003, 19:315-354.
doi: 10.1613/jair.1199
( Wang Fei. iLOF*: An Optimized Local Outlier Detection Algorithm[J]. Computer Systems & Applications, 2015, 24(12):233-238.)
[27]
Raskutti B, Kowalczyk A. Extreme re-Balancing for SVMS: A Case Study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):60-69.
doi: 10.1145/1007730.1007739
[28]
Wilk T, Wozniak M. Soft Computing Methods Applied to Combination of One-Class Classifiers[J]. Neurocomputing, 2012, 75(1):185-193.
doi: 10.1016/j.neucom.2011.02.023
[29]
Japkowicz N. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances[C]// Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence. 2001: 67-77.
[30]
Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017: 2999-3007.
[31]
Li B Y, Liu Y, Wang X G. Gradient Harmonized Single-Stage Detector[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33:8577-8584.
[32]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[33]
Zhang N, Jia Q, Yin K, et al. Conceptualized Representation Learning for Chinese Biomedical Text Mining[OL]. arXiv Preprint, arXiv:2008.10813.
[34]
王业沛. 基于深度学习的判决结果倾向性分析研究[D]. 南京: 南京大学, 2018.
[34]
( Wang Yepei. Orientation Analysis of Judgment Results Based on Deep Learning[D]. Nanjing: Nanjing University, 2018.)