[Objective] This paper addresses the classification issues facing unbalanced sample data, aiming to find a better solution and improve the prediction results of diabetic complications. [Methods] At the data level, we used the improved SMOTE oversampling algorithm (F_SMOTE) to change the class distribution of unbalanced data. At the algorithm level, we adopted the balance accuracy, ROC and AUC under PR curve as evaluation criteria. Finally, we compared the performance of four single classifier learning models and four ensemble learning models. [Results] Compared with the traditional over sampling algorithm, our F_SMOTE algorithm improved the prediction accuracy, ROC and PR by 1.49%, 3.43% and 8.05%, respectively. Compared with the single classifier learning model, our method improved the accuracy, ROC and PR by 9.73%, 14.07% and 46.79%, respectively. The combined F_SMOTE algorithm and Random Forest model reached 97.64% in accuracy, 98.91% in ROC and 96.64% in PR for unbalanced data. [Limitations] The coverage and efficiency of our model training needs to be further improved. [Conclusions] This method creates a predictive analysis framework for researchers, which could also help doctors in disease diagnosis and prevention.
( Hou Yumei, Zhu Ya’nan, Zhu Lichun, et al. Application of Decision Tree Model in Risk Prediction of Type 2 Diabetes[J]. China Journal of Health Statistics, 2016,33(6):976-978, 982.)
Sowjanya K, Singhal A, Choudhary C. MobDBTest: A Machine Learning Based System for Predicting Diabetes Risk Using Mobile Devices[C]//Proceedings of 2015 IEEE International Advance Computing Conference (IACC). 2015: 397-402.
( Zhang Hongxia, Guo He, Wang Jinxia, et al. Research on Type 2 Diabetes Mellitus Precise Prediction Models Based on XGBoost Algorithm[J]. Chinese Journal of Laboratory Diagnostics, 2018,22(3):408-412.)
( Nie Bin, Wang Zhuo, Du Jianqiang, et al. The Study on Classification of Secondary Complications of Diabetes Based on Rough Set and Random Forest[J]. Journal of Jiangxi Normal University (Natural Sciences Edition), 2014,38(3):278-281.)
( Wang Jie, Qiao Yixuan, Peng Yan, et al. Prediction of Type Ⅱ Diabetes Complications Based on Logistic Regression and Multilayer Neural Network[J]. Chinese High Technology Letters, 2019,29(5):455-461.)
VijiyaKumar K, Lavanya B, Nirmala I, et al. Random Forest Algorithm for the Prediction of Diabetes[C]//Proceedings of 2019 IEEE International Conference on System Computation, Automation and Networking (ICSCAN). 2019. DOI: 10.1109/ICSCAN.2019.8878802.
Wang Q, Cao W J, Guo J W, et al. DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data with Missing Values[J]. IEEE Access, 2019,7:102232-102238.
( Wang Zhongzhen, Huang Bo, Fang Zhijun, et al. Improved SMOTE Unbalanced Data Integration Classification Algorithm[J]. Journal of Computer Applications, 2019,39(9):2591-2596.)
Alghamdi M, Al-Mallah M, Keteyian S, et al. Predicting Diabetes Mellitus Using SMOTE and Ensemble Machine Learning Approach: The Henry Ford Exercise Testing (FIT) Project[J]. PLoS One, 2017,12(7):e0179805.
Ramesh D, Katheria Y S. Ensemble Method Based Predictive Model for Analyzing Disease Datasets: A Predictive Analysis Approach[J]. Health and Technology, 2019,9(4):533-545.