|
|
Predicting Diabetic Complications with Unbalanced Data |
Qiu Yunfei,Guo Lei() |
School of Software, Liaoning Technical University, Huludao 125105, China |
|
|
Abstract [Objective] This paper addresses the classification issues facing unbalanced sample data, aiming to find a better solution and improve the prediction results of diabetic complications. [Methods] At the data level, we used the improved SMOTE oversampling algorithm (F_SMOTE) to change the class distribution of unbalanced data. At the algorithm level, we adopted the balance accuracy, ROC and AUC under PR curve as evaluation criteria. Finally, we compared the performance of four single classifier learning models and four ensemble learning models. [Results] Compared with the traditional over sampling algorithm, our F_SMOTE algorithm improved the prediction accuracy, ROC and PR by 1.49%, 3.43% and 8.05%, respectively. Compared with the single classifier learning model, our method improved the accuracy, ROC and PR by 9.73%, 14.07% and 46.79%, respectively. The combined F_SMOTE algorithm and Random Forest model reached 97.64% in accuracy, 98.91% in ROC and 96.64% in PR for unbalanced data. [Limitations] The coverage and efficiency of our model training needs to be further improved. [Conclusions] This method creates a predictive analysis framework for researchers, which could also help doctors in disease diagnosis and prevention.
|
Received: 08 June 2020
Published: 11 November 2020
|
|
Corresponding Authors:
Guo Lei ORCID:0000-0003-3441-5063
E-mail: 752714018@qq.com
|
[1] |
张争辉, 薛爱芹, 于兰. 糖尿病相关研究进展[J]. 世界最新医学信息文摘, 2019, 19(20): 145, 149.
|
[1] |
( Zhang Zhenghui, Xue Aiqin, Yu Lan. Research Progress of Diabetes Mellitus[J]. World Latest Medical Information Abstracts, 2019, 19(20): 145, 149.)
|
[2] |
钟玉玲, 凡豪志, 张茹, 等. 糖化血红蛋白变异指数与糖尿病慢性并发症发生风险的相关性研究[J]. 中国全科医学, 2020,23(3):276-280, 288.
|
[2] |
( Zhong Yuling, Fan Haozhi, Zhang Ru, et al. Associations Between HbA1c Glycation Index and Risk of Chronic Complications of Diabetes Mellitus[J]. Chinese General Practice, 2020,23(3):276-280, 288.)
|
[3] |
Fatima M, Pasha M. Survey of Machine Learning Algorithms for Disease Diagnostic[J]. Journal of Intelligent Learning Systems and Applications, 2017,9(1):1-16.
doi: 10.4236/jilsa.2017.91001
|
[4] |
侯玉梅, 朱亚楠, 朱立春, 等. 决策树模型在Ⅱ型糖尿病患病风险预测中的应用[J]. 中国卫生统计, 2016,33(6):976-978, 982.
|
[4] |
( Hou Yumei, Zhu Ya’nan, Zhu Lichun, et al. Application of Decision Tree Model in Risk Prediction of Type 2 Diabetes[J]. China Journal of Health Statistics, 2016,33(6):976-978, 982.)
|
[5] |
Sowjanya K, Singhal A, Choudhary C. MobDBTest: A Machine Learning Based System for Predicting Diabetes Risk Using Mobile Devices[C]//Proceedings of 2015 IEEE International Advance Computing Conference (IACC). 2015: 397-402.
|
[6] |
崔波, 朱晓军. 混合kNN算法在Ⅱ型糖尿病预测诊断中的研究[J]. 现代电子技术, 2019,42(20):164-168.
|
[6] |
( Cui Bo, Zhu Xiaojun. Hybrid kNN Algorithm for Predictive Diagnosis of Type 2 Diabetes[J]. Modern Electronics Technique, 2019,42(20):164-168.)
|
[7] |
张洪侠, 郭贺, 王金霞, 等. 基于XGBoost算法的Ⅱ型糖尿病精准预测模型研究[J]. 中国实验诊断学, 2018,22(3):408-412.
|
[7] |
( Zhang Hongxia, Guo He, Wang Jinxia, et al. Research on Type 2 Diabetes Mellitus Precise Prediction Models Based on XGBoost Algorithm[J]. Chinese Journal of Laboratory Diagnostics, 2018,22(3):408-412.)
|
[8] |
林鑫, 李晋, 刘蕾, 等. 二型糖尿病肾病风险预测模型的比较[J]. 中华医学图书情报杂志, 2019,28(4):41-45.
|
[8] |
( Lin Xin, Li Jin, Liu Lei, et al. Risk Prediction Models of Type 2 Diabetic Nephropathy[J]. Chinese Journal of Medical Library and Information Science, 2019,28(4):41-45.)
|
[9] |
崔纯纯. 基于神经网络的糖尿病并发症预测系统研究[D]. 北京:北京交通大学, 2018.
|
[9] |
( Cui Chunchun. Study on Prediction System of Diabetic Complications Based on Neural Network[D]. Beijing: Beijing Jiaotong University, 2018.)
|
[10] |
聂斌, 王卓, 杜建强, 等. 基于粗糙集和随机森林算法辅助糖尿病并发症分类研究[J]. 江西师范大学学报(自然科学版), 2014,38(3):278-281.
|
[10] |
( Nie Bin, Wang Zhuo, Du Jianqiang, et al. The Study on Classification of Secondary Complications of Diabetes Based on Rough Set and Random Forest[J]. Journal of Jiangxi Normal University (Natural Sciences Edition), 2014,38(3):278-281.)
|
[11] |
刘迷迷, 蔡永铭. 基于多层感知神经网络的糖尿病并发症预测研究[J]. 软件, 2018,39(10):30-35.
|
[11] |
( Liu Mimi, Cai Yongming. Prediction of Diabetic Complications Based on MLP[J]. Computer Engineering & Software, 2018,39(10):30-35.)
|
[12] |
王洁, 乔艺璇, 彭岩, 等. 基于Logistic回归和多层神经网络的Ⅱ型糖尿病并发症预测[J]. 高技术通讯, 2019,29(5):455-461.
|
[12] |
( Wang Jie, Qiao Yixuan, Peng Yan, et al. Prediction of Type Ⅱ Diabetes Complications Based on Logistic Regression and Multilayer Neural Network[J]. Chinese High Technology Letters, 2019,29(5):455-461.)
|
[13] |
VijiyaKumar K, Lavanya B, Nirmala I, et al. Random Forest Algorithm for the Prediction of Diabetes[C]//Proceedings of 2019 IEEE International Conference on System Computation, Automation and Networking (ICSCAN). 2019. DOI: 10.1109/ICSCAN.2019.8878802.
|
[14] |
Wang Q, Cao W J, Guo J W, et al. DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data with Missing Values[J]. IEEE Access, 2019,7:102232-102238.
doi: 10.1109/Access.6287639
|
[15] |
刘斌, 陈凯. 基于SMOTE和XGBoost的贷款风险预测方法[J]. 计算机与现代化, 2020(2):26-30.
|
[15] |
( Liu Bin, Chen Kai. Loan Risk Prediction Method Based on SMOTE and XGBoost[J]. Computer and Modernization, 2020(2):26-30.)
|
[16] |
张家伟, 郭林明, 杨晓梅. 针对不平衡数据的过采样和随机森林改进算法[J]. 计算机工程与应用, 2020,56(11):39-45.
|
[16] |
( Zhang Jiawei, Guo Linming, Yang Xiaomei. Improved Oversampling and Random Forest Algorithm for Imbalanced Data[J]. Computer Engineering and Applications, 2020,56(11):39-45.)
|
[17] |
刘华玲, 林蓓, 恽文婧, 等. 互联网金融风险识别中类平衡处理方法对比研究——以拍拍贷为例[J]. 计算机科学, 2019,46(11A):595-598, 608.
|
[17] |
( Liu Hualing, Lin Bei, Yun Wenjing, et al. Comparison of Balancing Methods in Internet Finance Overdue Recognition: Taking PPDai.com as Case[J]. Computer Science, 2019,46(11A):595-598, 608.)
|
[18] |
王忠震, 黄勃, 方志军, 等. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019,39(9):2591-2596.
|
[18] |
( Wang Zhongzhen, Huang Bo, Fang Zhijun, et al. Improved SMOTE Unbalanced Data Integration Classification Algorithm[J]. Journal of Computer Applications, 2019,39(9):2591-2596.)
|
[19] |
Alghamdi M, Al-Mallah M, Keteyian S, et al. Predicting Diabetes Mellitus Using SMOTE and Ensemble Machine Learning Approach: The Henry Ford Exercise Testing (FIT) Project[J]. PLoS One, 2017,12(7):e0179805.
doi: 10.1371/journal.pone.0179805
pmid: 28738059
|
[20] |
Ramesh D, Katheria Y S. Ensemble Method Based Predictive Model for Analyzing Disease Datasets: A Predictive Analysis Approach[J]. Health and Technology, 2019,9(4):533-545.
doi: 10.1007/s12553-019-00299-3
|
[21] |
杨美洁, 唐建军. 基于随机森林算法的糖尿病预测研究[J]. 医学信息学杂志, 2019,40(9):47-49.
|
[21] |
( Yang Meijie, Tang Jianjun. Study on Predictions of Diabetes Mellitus Based on Random Forest Algorithm[J]. Journal of Medical Informatics, 2019,40(9):47-49.)
|
[22] |
贺小娟, 潘文捷, 程宏. 基于集成学习方法的点击率预估模型研究[J]. 计算机工程与科学, 2019,41(12):2278-2284.
|
[22] |
( He Xiaojuan, Pan Wenjie, Cheng Hong. An Advertisement Click-Through Rate Prediction Model Based on Ensemble Learning[J]. Computer Engineering and Science, 2019,41(12):2278-2284.)
|
[23] |
张春富, 王松, 吴亚东, 等. 基于GA_XGboost模型的糖尿病风险预测[J]. 计算机工程, 2020,46(3):315-320.
|
[23] |
( Zhang Chunfu, Wang Song, Wu Yadong, et al. Diabetes Risk Prediction Based on GA_XGboost Model[J]. Computer Engineering, 2020,46(3):315-320.)
|
[24] |
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic Minority Over-sampling Technique[J]. Journal of Artificial Intelligence Research, 2011,16(1):321-357.
doi: 10.1613/jair.953
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|