Feature Selection and Efficient Disease Early Warning Based on Optimized Ensemble Learning Model:Case Study of Geriatric Depression and Anxiety
Yan Ying1,Huang Qi1,2(),Li Na1
1School of Information Management, Nanjing University, Nanjing 210023, China 2Nanjing Research Base of National Information Management, Nanjing University, Nanjing 210093, China
[Objective] This paper makes disease prediction models balance computational efficiency and prediction accuracy by selecting key disease risk variables, aiming to help public health departments achieve efficient disease early warning. [Methods] We used ensemble learning-based Random Forest and XGBoost models to learn high-dimensional disease risk variable data for disease prediction. The models autonomously select subsets of variables that contribute to their prediction. To ensure that the selected subset has high prediction accuracy, we analyze the ensemble strategy of Random Forest and XGBoost. By adjusting hyperparameters and cross-validating, we improved the out-of-bag error rate of the Random Forest model iteratively and converged the loss curve of the XGBoost model on different sub-training sets. Finally, we proposed unique optimization solutions for each model to enhance their disease prediction performance. [Results] We examined the optimized models with the dataset of geriatric depression and anxiety. They exhibited excellent and comparable disease prediction performance, achieving prediction accuracies of 88.6% and 89.7%, as well as AUC values of 0.936 and 0.940, respectively. However, the XGBoost model had a simpler and more efficient structure with the optimized feature selection. It selected only 17 key variables out of 54 geriatric depression and anxiety risk variables, achieving a prediction accuracy of 85.8% and an AUC of 0.917. [Limitations] We did not utilize the latest geriatric cohort data for experimentation. More research is needed to test the adaptability of models in complex and heterogenous data environments. [Conclusions] The feature selection effect of the optimized XGBoost model is superior in improving the efficiency of disease early warning and providing decision support for public health management.
严颖, 黄奇, 李娜. 基于优化后集成学习模型的特征选择与疾病高效预警研究——以老年抑郁焦虑为例[J]. 数据分析与知识发现, 2023, 7(7): 74-88.
Yan Ying, Huang Qi, Li Na. Feature Selection and Efficient Disease Early Warning Based on Optimized Ensemble Learning Model:Case Study of Geriatric Depression and Anxiety. Data Analysis and Knowledge Discovery, 2023, 7(7): 74-88.
(Wang Ping, Mu Dongmei, Gao Hexuan, et al. Research on Crisis Detection in Infectious Disease Surveillance Data[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(5): 492-499.)
[2]
王敏虾. 基于逻辑回归关联规则的疾病预警模型[D]. 济南: 山东大学, 2016.
[2]
(Wang Minxia. Disease Early Warning Model Based on Logistic Regression Association Rules[D]. Jinan: Shandong University, 2016.)
[3]
Liu H, Motoda H, Setiono R, et al. Feature Selection: An Ever Evolving Frontier in Data Mining[C]// Proceedings of the 4th Workshop on Feature Selection in Data Mining. 2010: 4-13.
[4]
Aziz N A A, Maarof M A, Zainal A. Hate Speech and Offensive Language Detection: A New Feature Set with Filter-Embedded Combining Feature Selection[C]// Proceedings of the 3rd International Cyber Resilience Conference. IEEE, 2021.
(Zhang Wende, Cheng Han, Liu Tian, et al. Application of Random Forest in the Fragmented Integration of University Information[J]. Library and Information Service, 2018, 62(7): 119-124.)
doi: 10.13266/j.issn.0252-3116.2018.07.014
[6]
张燕. 基于决策树的老年心血管疾病住院患者衰弱预测模型构建[D]. 汕头: 汕头大学, 2021.
[6]
(Zhang Yan. Construction of the Debilitating Prediction Model for Elderly Inpatients with Cardiovascular Diseases Based on Decision Tree[D]. Shantou: Shantou University, 2021.)
[7]
Khanam J J, Foo S Y. A Comparison of Machine Learning Algorithms for Diabetes Prediction[J]. ICT Express, 2021, 7(4): 432-439.
doi: 10.1016/j.icte.2021.02.004
[8]
Kwekha-Rashid A S, Abduljabbar H N, Alhayani B. Coronavirus Disease (COVID-19) Cases Analysis Using Machine-Learning Applications[J]. Applied Nanoscience, 2023, 13(3): 2013-2025.
doi: 10.1007/s13204-021-01868-7
[9]
Safdari R, Deghatipour A, Gholamzadeh M, et al. Applying Data Mining Techniques to Classify Patients with Suspected Hepatitis C Virus Infection[J]. Intelligent Medicine, 2022, 2(4): 193-198.
doi: 10.1016/j.imed.2021.12.003
[10]
Dagliati A, Marini S, Sacchi L, et al. Machine Learning Methods to Predict Diabetes Complications[J]. Journal of Diabetes Science and Technology, 2018, 12(2): 295-302.
doi: 10.1177/1932296817706375
pmid: 28494618
[11]
Quesada J A, Lopez-Pineda A, Gil-Guillén V F, et al. Machine Learning to Predict Cardiovascular Risk[J]. International Journal of Clinical Practice, 2019, 73(10): e13389.
[12]
Bari Antor M, Jamil A H M S, Mamtaz M, et al. A Comparative Analysis of Machine Learning Algorithms to Predict Alzheimer's Disease[J]. Journal of Healthcare Engineering, 2021, 2021: 9917919.
[13]
Golpour P, Ghayour-Mobarhan M, Saki A, et al. Comparison of Support Vector Machine, Naïve Bayes and Logistic Regression for Assessing the Necessity for Coronary Angiography[J]. International Journal of Environmental Research and Public Health, 2020, 17(18): 6449.
doi: 10.3390/ijerph17186449
[14]
Schober P, Vetter T R. Logistic Regression in Medical Research[J]. Anesthesia and Analgesia, 2021, 132(2): 365-366.
doi: 10.1213/ANE.0000000000005247
pmid: 33449558
[15]
Xing W C, Bei Y L. Medical Health Big Data Classification Based on KNN Classification Algorithm[J]. IEEE Access, 2019, 8: 28808-28819.
doi: 10.1109/Access.6287639
[16]
Sagi O, Rokach L. Ensemble Learning: A Survey[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018, 8(4): e1249.
doi: 10.1002/widm.2018.8.issue-4
(Zeng Ziming, Wang Jing. Research on Microblog Rumor Identification Based on LDA and Random Forest[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(1): 89-96.)
[18]
Shafi A S M, Rahman M B, Anwar T, et al. Classification of Brain Tumors and Auto-Immune Disease Using Ensemble Learning[J]. Informatics in Medicine Unlocked, 2021, 24: 100608.
doi: 10.1016/j.imu.2021.100608
[19]
Zhang Y L, Feng T, Wang S D, et al. A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations[J]. Frontiers in Genetics, 2020, 11: 585029.
doi: 10.3389/fgene.2020.585029
(Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(8): 86-99.)
[21]
Zhang Y Y, Wang S J, Hermann A, et al. Development and Validation of a Machine Learning Algorithm for Predicting the Risk of Postpartum Depression among Pregnant Women[J]. Journal of Affective Disorders, 2021, 279: 1-8.
doi: 10.1016/j.jad.2020.09.113
pmid: 33035748
[22]
Putatunda S, Rama K. A Comparative Analysis of Hyperopt as Against Other Approaches for Hyper-Parameter Optimization of XGBoost[C]// Proceedings of the 2018 International Conference on Signal Processing and Machine Learning. 2018: 6-10.
[23]
Ogunleye A, Wang Q G. XGBoost Model for Chronic Kidney Disease Diagnosis[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020, 17(6): 2131-2140.
doi: 10.1109/TCBB.8857
[24]
Wang Y, Ni X S. A XGBoost Risk Model via Feature Selection and Bayesian Hyper-Parameter Optimization[OL]. arXiv Preprint, arXiv:1901.08433.
[25]
梅雪峰. 基于代价敏感的分类集成学习算法研究[D]. 南京: 南京邮电大学, 2021.
[25]
(Mei Xuefeng. Research on Classification Ensemble Learning Algorithm Based on Cost Sensitivity[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2021.)
[26]
Sun L Y. Application and Improvement of XGBoost Algorithm Based on Multiple Parameter Optimization Strategy[C]// Proceedings of the 5th International Conference on Mechanical, Control and Computer Engineering. 2020: 1822-1825.
[27]
Budholiya K, Shrivastava S K, Sharma V. An Optimized XGBoost Based Diagnostic System for Effective Prediction of Heart Disease[J]. Journal of King Saud University-Computer and Information Sciences, 2022, 34(7): 4514-4523.
doi: 10.1016/j.jksuci.2020.10.013
[28]
Qin C, Zhang Y F, Bao F X, et al. XGBoost Optimized by Adaptive Particle Swarm Optimization for Credit Scoring[J]. Mathematical Problems in Engineering, 2021, 2021: 1-18.
[29]
Sun S L, Jin F, Li H T, et al. A New Hybrid Optimization Ensemble Learning Approach for Carbon Price Forecasting[J]. Applied Mathematical Modelling, 2021, 97: 182-205.
doi: 10.1016/j.apm.2021.03.020
[30]
Dhar J. Multistage Ensemble Learning Model with Weighted Voting and Genetic Algorithm Optimization Strategy for Detecting Chronic Obstructive Pulmonary Disease[J]. IEEE Access, 2021, 9: 48640-48657.
doi: 10.1109/ACCESS.2021.3067949
[31]
Syahrani I M. Comparation Analysis of Ensemble Technique with Boosting (XGBoost) and Bagging (RandomForest) for Classify Splice Junction DNA Sequence Category[J]. Jurnal Penelitian Pos dan Informatika, 2019, 9(1): 27-36.
[32]
Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[33]
Geisser S. The Predictive Sample Reuse Method with Applications[J]. Journal of the American Statistical Association, 1975, 70(350): 320-328.
doi: 10.1080/01621459.1975.10479865
[34]
Morgan K. The Nottingham Longitudinal Study of Activity and Ageing: A Methodological Overview[J]. Age and Ageing, 1998, 27(suppl_3): 5-11.
[35]
Gotlib I H, Joormann J. Cognition and Depression: Current Status and Future Directions[J]. Annual Review of Clinical Psychology, 2010, 6: 285 -312.
doi: 10.1146/annurev.clinpsy.121208.131305
pmid: 20192795
[36]
Bedford A, Foulds G A, Sheffield B F. A New Personal Disturbance Scale (DSSI/sAD)[J]. British Journal of Social and Clinical Psychology, 1976, 15(4): 387-394.
doi: 10.1111/j.2044-8260.1976.tb00050.x
pmid: 1000147
[37]
Lin H Y, Jin M D, Liu Q, et al. Gender-Specific Prevalence and Influencing Factors of Depression in Elderly in Rural China: A Cross-Sectional Study[J]. Journal of Affective Disorders, 2021, 288: 99-106.
doi: 10.1016/j.jad.2021.03.078
pmid: 33848754
[38]
Mulat N, Gutema H, Wassie G T. Prevalence of Depression and Associated Factors among Elderly People in Womberma District, North-West, Ethiopia[J]. BMC Psychiatry, 2021, 21(1): 136.
doi: 10.1186/s12888-021-03145-x
pmid: 33685419
(Kou Xiaojun, Gong Chuanpeng, Liu Xiujun, et al. Prevalence and Influencing Factors of Anxiety and Depression among the Elderly in Wuhan Community[J]. Chinese Journal of Gerontology, 2018, 38(10): 2529-2531.)
[40]
Yuziani, Maulina M. The Correlation Between Stress Level and Degree of Depression in the Elderly at a Nursing Home in Lhokseumawe in the Year 2017[C]// Proceedings of MICoMS 2017. 2018: 497-502.
[41]
Byeon H. Exploring Factors for Predicting Anxiety Disorders of the Elderly Living Alone in South Korea Using Interpretable Machine Learning: A Population-Based Study[J]. International Journal of Environmental Research and Public Health, 2021, 18(14): 7625.
doi: 10.3390/ijerph18147625
[42]
Hossain B, Yadav P K, Nagargoje V P, et al. Association Between Physical Limitations and Depressive Symptoms among Indian Elderly: Marital Status as a Moderator[J]. BMC Psychiatry, 2021, 21(1): 573.
doi: 10.1186/s12888-021-03587-3
pmid: 34781925
[43]
Pan L, Li L, Peng H Y, et al. Association of Depressive Symptoms with Marital Status among the Middle-Aged and Elderly in Rural China-Serial Mediating Effects of Sleep Time, Pain and Life Satisfaction[J]. Journal of Affective Disorders, 2022, 303: 52-57.
doi: 10.1016/j.jad.2022.01.111
pmid: 35124113
(Li Lei, Ma Mengyuan, Peng Hongye, et al. Prevalence and Associated Factors of Depressive Symptoms in China's Rural Elderly[J]. Chinese General Practice, 2021, 24(27): 3432-3438.)
doi: 10.12114/j.issn.1007-9572.2021.00.577
(Wang Miao, Pan Qing. The Rural-Urban Differences and Influencing Factors in the Anxiety Symptoms of Chinese Elderly People[J]. Chinese General Practice, 2021, 24(31): 3963-3970.)
doi: 10.12114/j.issn.1007-9572.2021.00.294
[46]
Ma X M, Zhang X F, Guo X T, et al. Examining the Role of ICT Usage in Loneliness Perception and Mental Health of the Elderly in China[J]. Technology in Society, 2021, 67: 101718.
doi: 10.1016/j.techsoc.2021.101718
(Liang Weiwei, Li Juan, Liu Yuanyuan, et al. Correlation Between Depression, Anxiety and Social Support Among the Elderly in Beijing and Guangzhou Communities[J]. Chinese Journal of Alzheimer's Disease and Related Disorders, 2020, 3(2): 129-135.)
doi: 10.3969/j.issn.2096-5516.2020.02.008
(Deng Xuexue, Fang Ronghua, Mao Yan, et al. Prevalence and Influencing Factors of Anxiety and Depression in Hospitalized Elderly Patients in the General Medicine Ward of a General Hospital[J]. Chinese General Practice, 2020, 23(1): 96-100.)
doi: 10.12114/j.issn.1007-9572.2019.00.603
[49]
de Oliveira L D S S C B, Souza E C, Rodrigues R A S, et al. The Effects of Physical Activity on Anxiety, Depression, and Quality of Life in Elderly People Living in the Community[J]. Trends in Psychiatry and Psychotherapy, 2019, 41(1): 36-42.
doi: S2237-60892019000100005
pmid: 30994779
[50]
Zhang M M, Ma Y, Du L T, et al. Sleep Disorders and Non-Sleep Circadian Disorders Predict Depression: A Systematic Review and Meta-Analysis of Longitudinal Studies[J]. Neuroscience & Biobehavioral Reviews, 2022, 134: 104532.
doi: 10.1016/j.neubiorev.2022.104532
[51]
Mukku S S R, Dahale A B, Muniswamy N R, et al. Geriatric Depression and Cognitive Impairment—An Update[J]. Indian Journal of Psychological Medicine, 2021, 43(4): 286-293.
doi: 10.1177/0253717620981556
[52]
Bhandari P, Paswan B. Lifestyle Behaviours and Mental Health Outcomes of Elderly: Modification of Socio-economic and Physical Health Effects[J]. Ageing International, 2021, 46(1): 35-69.
doi: 10.1007/s12126-020-09371-0
[53]
Xazratovich K Z. Depression and Anxiety in Patients with Alcoholism Complicated by Nicotine Addiction[J]. Eurasian Medical Research Periodical, 2022, 9: 65-67.