|
|
Automatic Classification with Unbalanced Data for Electronic Medical Records |
Zhang Yunqiu1(),Li Bocheng1,Chen Yan2 |
1College of Public Health, Jilin University, Changchun 130021, China 2Shenzhen Health Development Research and Data Management Center, Shenzhen 518028, China |
|
|
Abstract [Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.
|
Received: 31 August 2021
Published: 14 April 2022
|
|
Fund:Humanities and Social Science Foundation of Ministry of Education(18YJA870017);Entrusted Project of Shenzhen Medical Information Center(2020(261));Graduate Innovation Fund of Jilin University(101832020CX279) |
Corresponding Authors:
Zhang Yunqiu,ORCID:0000-0002-9790-9581
E-mail: yunqiu@jlu.edu.cn
|
[1] |
卫生部, 国家中药管理局. 病历书写基本规范(试行)[J]. 中国卫生法制, 2002, 10(5):183-186.
|
[1] |
(Ministry of Health, State Administration of Traditional Chinese Medicine. Basic Norms for Medical Record Writing (for Trial Implementation)[J]. China Health Law, 2002, 10(5):183-186.)
|
[2] |
Uribe M O. International Classification of Diseases, World Health Organization, Tenth Version ICD-10[J]. Salud Mental, 1996, 19:11-18.
|
[3] |
Hirsch J A, Nicola G, McGinty G, et al. ICD-10: History and Context[J]. American Journal of Neuroradiology, 2016, 37(4):596-599.
doi: 10.3174/ajnr.A4696
pmid: 26822730
|
[4] |
Chen C W, Tseng S P, Kuan T W, et al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital[J]. Information, 2020, 11(2):106.
doi: 10.3390/info11020106
|
[5] |
钟佳娃, 刘巍, 王思丽, 等. 文本情感分析方法及应用综述[J]. 数据分析与知识发现, 2021, 5(6):1-13.
|
[5] |
( Zhong Jiawa, Liu Wei, Wang Sili, et al. Review of Methods and Applications of Text Sentiment Analysis[J]. Data Analysis and Knowledge Discovery, 2021, 5(6):1-13.)
|
[6] |
Koopman B, Zuccon G, Nguyen A, et al. Automatic ICD-10 Classification of Cancers from Free-Text Death Certificates[J]. International Journal of Medical Informatics, 2015, 84(11):956-965.
doi: 10.1016/j.ijmedinf.2015.08.004
pmid: 26323193
|
[7] |
Perotte A, Pivovarov R, Natarajan K, et al. Diagnosis Code Assignment: Models and Evaluation Metrics[J]. Journal of the American Medical Informatics Association, 2013, 21(2):231-237.
doi: 10.1136/amiajnl-2013-002159
|
[8] |
孙松涛. 基于深度学习的微博复杂情感分类研究[D]. 武汉: 武汉大学, 2017.
|
[8] |
( Sun Songtao. Research of Complex Emotion Classification for Microblog Based on Deep Learning[D]. Wuhan: Wuhan University, 2017.)
|
[9] |
Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014: 1746-1751.
|
[10] |
Chiu J P C, Nichols E. Named Entity Recognition with Bidirectional LSTM-CNNS[J]. Transactions of the Association for Computational Linguistics, 2016, 4:357-370.
doi: 10.1162/tacl_a_00104
|
[11] |
Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[OL]. arXiv Preprint, arXiv:1802.05695.
|
[12] |
Duarte F, Martins B, Pinto C S, et al. Deep Neural Models for ICD-10 Coding of Death Certificates and Autopsy Reports in Free-Text[J]. Journal of Biomedical Informatics, 2018, 80:64-77.
doi: 10.1016/j.jbi.2018.02.011
|
[13] |
Baumel T, Nassour-Kassis J, Elhadad M, et al. Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment[OL]. arXiv Preprint, arXiv: 1709.09587.
|
[14] |
张虹科, 付振新, 任前平, 等. 基于融合条目词嵌入和注意力机制的自动ICD编码[J]. 北京大学学报(自然科学版), 2020, 56(1):1-8.
|
[14] |
( Zhang Hongke, Fu Zhenxin, Ren Qianping, et al. Automated ICD Coding Based on Word Embedding with Entry Embedding and Attention Mechanism[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(1):1-8.)
|
[15] |
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-Sampling Technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
doi: 10.1613/jair.953
|
[16] |
Han H, Wang W Y, Mao B H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning[C]// Proceedings of the 2005 International Conference on Intelligent Computing. 2005: 878-887.
|
[17] |
He H B, Bai Y, Garcia E A, et al. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning[C]// Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IEEE, 2008: 1322-1328.
|
[18] |
Laurikkala J. Improving Identification of Difficult Small Classes by Balancing Class Distribution[C]// Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe. 2001: 63-66.
|
[19] |
Tomek I. Two Modifications of CNN[J]. IEEE Transactions on Systems Man & Cybernetics, 1976,SMC- 6(11):769-772.
|
[20] |
Ng W W Y, Hu J J, Yeung D S, et al. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11):2402-2412.
doi: 10.1109/TCYB.2014.2372060
|
[21] |
Ling C X, Li C. Data Mining for Direct Marketing: Problems and Solutions[C]// Proceedings of the 14th International Conference on Knowledge Discovery and Data Mining. 1998, 98:73-79.
|
[22] |
Weiss G M, Provost F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction[J]. Journal of Artificial Intelligence Research, 2003, 19:315-354.
doi: 10.1613/jair.1199
|
[23] |
翟云, 杨炳儒, 曲武. 不平衡类数据挖掘研究综述[J]. 计算机科学, 2010, 37(10):27-32.
|
[23] |
( Zhai Yun, Yang Bingru, Qu Wu. Survey of Mining Imbalanced Datasets[J]. Computer Science, 2010, 37(10):27-32.)
|
[24] |
Liu F T, Ting K M, Zhou Z H. Isolation-Based Anomaly Detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): Article No.3.
|
[25] |
薛安荣, 鞠时光, 何伟华, 等. 局部离群点挖掘算法研究[J]. 计算机学报, 2007, 30(8):1455-1463.
|
[25] |
( Xue Anrong, Ju Shiguang, He Weihua, et al. Study on Algorithms for Local Outlier Detection[J]. Chinese Journal of Computers, 2007, 30(8):1455-1463.)
|
[26] |
王飞. iLOF*: 一种改进的局部异常检测算法[J]. 计算机系统应用, 2015, 24(12):233-238.
|
[26] |
( Wang Fei. iLOF*: An Optimized Local Outlier Detection Algorithm[J]. Computer Systems & Applications, 2015, 24(12):233-238.)
|
[27] |
Raskutti B, Kowalczyk A. Extreme re-Balancing for SVMS: A Case Study[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):60-69.
doi: 10.1145/1007730.1007739
|
[28] |
Wilk T, Wozniak M. Soft Computing Methods Applied to Combination of One-Class Classifiers[J]. Neurocomputing, 2012, 75(1):185-193.
doi: 10.1016/j.neucom.2011.02.023
|
[29] |
Japkowicz N. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances[C]// Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence. 2001: 67-77.
|
[30] |
Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. 2017: 2999-3007.
|
[31] |
Li B Y, Liu Y, Wang X G. Gradient Harmonized Single-Stage Detector[C]// Proceedings of the 2019 AAAI Conference on Artificial Intelligence. 2019, 33:8577-8584.
|
[32] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
|
[33] |
Zhang N, Jia Q, Yin K, et al. Conceptualized Representation Learning for Chinese Biomedical Text Mining[OL]. arXiv Preprint, arXiv:2008.10813.
|
[34] |
王业沛. 基于深度学习的判决结果倾向性分析研究[D]. 南京: 南京大学, 2018.
|
[34] |
( Wang Yepei. Orientation Analysis of Judgment Results Based on Deep Learning[D]. Nanjing: Nanjing University, 2018.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|