[Objective] This paper addresses the classification issues facing unbalanced sample data, aiming to find a better solution and improve the prediction results of diabetic complications. [Methods] At the data level, we used the improved SMOTE oversampling algorithm (F_SMOTE) to change the class distribution of unbalanced data. At the algorithm level, we adopted the balance accuracy, ROC and AUC under PR curve as evaluation criteria. Finally, we compared the performance of four single classifier learning models and four ensemble learning models. [Results] Compared with the traditional over sampling algorithm, our F_SMOTE algorithm improved the prediction accuracy, ROC and PR by 1.49%, 3.43% and 8.05%, respectively. Compared with the single classifier learning model, our method improved the accuracy, ROC and PR by 9.73%, 14.07% and 46.79%, respectively. The combined F_SMOTE algorithm and Random Forest model reached 97.64% in accuracy, 98.91% in ROC and 96.64% in PR for unbalanced data. [Limitations] The coverage and efficiency of our model training needs to be further improved. [Conclusions] This method creates a predictive analysis framework for researchers, which could also help doctors in disease diagnosis and prevention.
( Zhong Yuling, Fan Haozhi, Zhang Ru, et al. Associations Between HbA1c Glycation Index and Risk of Chronic Complications of Diabetes Mellitus[J]. Chinese General Practice, 2020,23(3):276-280, 288.)
[3]
Fatima M, Pasha M. Survey of Machine Learning Algorithms for Disease Diagnostic[J]. Journal of Intelligent Learning Systems and Applications, 2017,9(1):1-16.
doi: 10.4236/jilsa.2017.91001
( Hou Yumei, Zhu Ya’nan, Zhu Lichun, et al. Application of Decision Tree Model in Risk Prediction of Type 2 Diabetes[J]. China Journal of Health Statistics, 2016,33(6):976-978, 982.)
[5]
Sowjanya K, Singhal A, Choudhary C. MobDBTest: A Machine Learning Based System for Predicting Diabetes Risk Using Mobile Devices[C]//Proceedings of 2015 IEEE International Advance Computing Conference (IACC). 2015: 397-402.
( Zhang Hongxia, Guo He, Wang Jinxia, et al. Research on Type 2 Diabetes Mellitus Precise Prediction Models Based on XGBoost Algorithm[J]. Chinese Journal of Laboratory Diagnostics, 2018,22(3):408-412.)
( Lin Xin, Li Jin, Liu Lei, et al. Risk Prediction Models of Type 2 Diabetic Nephropathy[J]. Chinese Journal of Medical Library and Information Science, 2019,28(4):41-45.)
[9]
崔纯纯. 基于神经网络的糖尿病并发症预测系统研究[D]. 北京:北京交通大学, 2018.
[9]
( Cui Chunchun. Study on Prediction System of Diabetic Complications Based on Neural Network[D]. Beijing: Beijing Jiaotong University, 2018.)
( Nie Bin, Wang Zhuo, Du Jianqiang, et al. The Study on Classification of Secondary Complications of Diabetes Based on Rough Set and Random Forest[J]. Journal of Jiangxi Normal University (Natural Sciences Edition), 2014,38(3):278-281.)
( Wang Jie, Qiao Yixuan, Peng Yan, et al. Prediction of Type Ⅱ Diabetes Complications Based on Logistic Regression and Multilayer Neural Network[J]. Chinese High Technology Letters, 2019,29(5):455-461.)
[13]
VijiyaKumar K, Lavanya B, Nirmala I, et al. Random Forest Algorithm for the Prediction of Diabetes[C]//Proceedings of 2019 IEEE International Conference on System Computation, Automation and Networking (ICSCAN). 2019. DOI: 10.1109/ICSCAN.2019.8878802.
[14]
Wang Q, Cao W J, Guo J W, et al. DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data with Missing Values[J]. IEEE Access, 2019,7:102232-102238.
doi: 10.1109/Access.6287639
( Zhang Jiawei, Guo Linming, Yang Xiaomei. Improved Oversampling and Random Forest Algorithm for Imbalanced Data[J]. Computer Engineering and Applications, 2020,56(11):39-45.)
( Liu Hualing, Lin Bei, Yun Wenjing, et al. Comparison of Balancing Methods in Internet Finance Overdue Recognition: Taking PPDai.com as Case[J]. Computer Science, 2019,46(11A):595-598, 608.)
( Wang Zhongzhen, Huang Bo, Fang Zhijun, et al. Improved SMOTE Unbalanced Data Integration Classification Algorithm[J]. Journal of Computer Applications, 2019,39(9):2591-2596.)
[19]
Alghamdi M, Al-Mallah M, Keteyian S, et al. Predicting Diabetes Mellitus Using SMOTE and Ensemble Machine Learning Approach: The Henry Ford Exercise Testing (FIT) Project[J]. PLoS One, 2017,12(7):e0179805.
doi: 10.1371/journal.pone.0179805
pmid: 28738059
[20]
Ramesh D, Katheria Y S. Ensemble Method Based Predictive Model for Analyzing Disease Datasets: A Predictive Analysis Approach[J]. Health and Technology, 2019,9(4):533-545.
doi: 10.1007/s12553-019-00299-3
( Yang Meijie, Tang Jianjun. Study on Predictions of Diabetes Mellitus Based on Random Forest Algorithm[J]. Journal of Medical Informatics, 2019,40(9):47-49.)
( He Xiaojuan, Pan Wenjie, Cheng Hong. An Advertisement Click-Through Rate Prediction Model Based on Ensemble Learning[J]. Computer Engineering and Science, 2019,41(12):2278-2284.)
( Zhang Chunfu, Wang Song, Wu Yadong, et al. Diabetes Risk Prediction Based on GA_XGboost Model[J]. Computer Engineering, 2020,46(3):315-320.)
[24]
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2011,16(1):321-357.
doi: 10.1613/jair.953