[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited, which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.
徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning. Data Analysis and Knowledge Discovery, 2021, 5(8): 86-99.
Bray F, Ferlay J, Soerjomataram I, et al. Global Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2018, 68(6):394-424.
Shin H, Nam Y. A Coupling Approach of a Predictor and a Descriptor for Breast Cancer Prognosis[J]. BMC Medical Genomics, 2014, 7(S1):S4. DOI: 10.1186/1755-8794-7-S1-S4.
Allemani C, Matsuda T, Di Carlo V, et al. Global Surveillance of Trends in Cancer Survival 2000-14 (CONCORD-3): Analysis of Individual Records for 37 513 025 Patients Diagnosed with One of 18 Cancers from 322 Population-based Registries in 71 Countries[J]. The Lancet, 2018, 391(10125):1023-1075.
Yang L M, Takimoto T, Fujimoto J. Prognostic Model for Predicting Overall Survival in Children and Adolescents with Rhabdomyosarcoma[J]. BMC Cancer, 2014, 14(1):654.
Park I, Lee J L, Ryu M H, et al. Prognostic Factors and Predictive Model in Patients with Advanced Biliary Tract Adenocarcinoma Receiving First-Line Palliative Chemotherapy[J]. Cancer: Interdisciplinary International Journal of the American Cancer Society, 2009, 115(18):4148-4155.
( Feng Tingting, Ling Sunbin, Liu Bixia, et al. Prognostic Factors of Long-term Outcome of Non-functional Pancreatic Neuroendocrine Neoplasms Following Surgical Treatment: A Retrospective Study Based on SEER Database[J]. China Cancer, 2017, 26(11):910-914.)
( Pan Hui, Zhang Yalei, Xiao Dakai, et al. Nomogram for Prediction of Survival of Postoperative Small Cell Lung Cancer Patients: An Analysis Based on SEER[J]. Journal of Cancer Control and Treatment, 2019, 32(6):516-523.)
Kim W, Kim K S, Park R W. Nomogram of Naive Bayesian Model for Recurrence Prediction of Breast Cancer[J]. Healthcare Informatics Research, 2016, 22(2):89-94.
Kim W, Kim K S, Lee J E, et al. Development of Novel Breast Cancer Recurrence Prediction Model Using Support Vector Machine[J]. Journal of Breast Cancer, 2012, 15(2):230-238.
Lynch C M, Abdollahi B, Fuqua J D, et al. Prediction of Lung Cancer Patient Survival via Supervised Machine Learning Classification Techniques[J]. International Journal of Medical Informatics, 2017, 108:1-8.
( Yin Bincan, Xin Shichao, Zhang Han, et al. Building Asian Tumor-patients Prognostic Model with Bayesian Network and SEER Database——Case Study of Non-small Cell Lung Cancer[J]. Data Analysis and Knowledge Discovery, 2017, 1(2):41-46.)
Hasan M M, Haque M R, Kabir M M J. Breast Cancer Diagnosis Models Using PCA and Different Neural Network Architectures[C]// Proceedings of 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering IC4ME2. IEEE, 2019.
( Huang Zhigang, Liu Hong, Liu Juan, et al. Gastric Cancer Prediction Model Based on C5.0 Classification Algorithm[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(4):406-410.)
Wong M L, Seng K, Wong P K. Cost-sensitive Ensemble of Stacked Denoising Autoencoders for Class Imbalance Problems in Business Domain[J]. Expert Systems with Applications, 2020, 141:112918.
Thabtah F. Machine Learning in Autistic Spectrum Disorder Behavioral Research: A Review and Ways Forward[J]. Informatics for Health and Social Care, 2019, 44(3):278-297.
Thabtah F, Hammoud S, Kamalov F, et al. Data Imbalance in Classification: Experimental Evaluation[J]. Information Sciences, 2020, 513:429-441.
Lee H K, Kim S B. An Overlap-sensitive Margin Classifier for Imbalanced and Overlapping Data[J]. Expert Systems with Applications, 2018, 98:72-83.
Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2003: 107-119.
Liu X Y, Wu J, Zhou Z H. Exploratory Undersampling for Class-imbalance Learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2008, 39(2):539-550.