Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 86-99    DOI: 10.11925/infotech.2096-3467.2021.0045
Current Issue | Archive | Adv Search |
Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning
Xu Liangchen,Guo Chonghui()
Institute of Systems Engineering, Dalian University of Technology, Dalian 116024, China
Download: PDF (1851 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      

[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited, which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.

Key wordsSurvival Prediction      Ensemble Learning      Data Imbalance      Gastric Cancer      Interpretability     
Received: 15 January 2021      Published: 14 April 2021
ZTFLH:  R730 G350  
Fund:National Natural Science Foundation of China(71771034);Fundamental Research Funds for the Central Universities(DUT21YG108)
Corresponding Authors: Guo Chonghui ORCID:0000-0002-5155-1297     E-mail:

Cite this article:

Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning. Data Analysis and Knowledge Discovery, 2021, 5(8): 86-99.

URL:     OR

The Research Process of Gastric Cancer Survival Prediction
5-year Survival Prediction Model for Gastric Cancer
真实情况 预测结果
正例 反例
正例 TP(真正例) FN(假反例)
反例 FP(假正例) TN(真反例)
Confusion Matrix of Binary Classification Results
Schematic Representation of a SHAP Model
编码 位点
C160 贲门,未特指
C161 胃底
C162 胃体
C163 胃窦
C164 幽门
C165 胃小弯,未特指
C166 胃大弯,未特指
C168 胃部的重叠性病灶
C169 胃,未特指
Gastric Cancer Node Cooling
数据类型 变量 SEER中字段
类别变量 性别 Sex
种族 Race recode (W, B, AI, API)
地区 State-county
婚姻状况 Marital status at diagnosis
发病部位 Primary Site
组织学形态 Histologic Type ICD-O-3
阶段 Summary stage 2000 (1998+)
组织学分级 Grade
患侧部位 Laterality
放疗记录 Radiation recode
化疗记录 Chemotherapy recode
连续变量 确诊时年龄 Age at diagnosis
阳性淋巴结数量 Regional nodes positive
确诊时间 Year of diagnosis
Related Features of Gastric Cancer Survival Analysis
Continuous Variable Data Description
变量 变量值 数据库对应值 数量 编码
性别 Male 36 452 1
Female 21 676 0
种族 白人 White 41 406 1
黑人 Black 7 750 2
美洲印第安人/阿拉斯加原住民 American Indian/Alaska Native 8 545 3
亚洲或太平洋岛民 Asian or Pacific Islander 427 4
婚姻状况 单身 Single (never married) 7 306 1
已婚 Married (including common law) 34 662 2
分居 Separated 566 3
离婚 Divorced 4 419 4
丧偶 Widowed 11 163 5
未婚或家庭伴侣 Unmarried or Domestic Partner 12 6
组织学分级 I级 Grade I 3 367 1
II级 Grade II 15 577 2
III级 Grade III 37 457 3
IV级 Grade IV 1 727 4
阶段 局部(未扩散) Localized 13 712 1
区域(淋巴结转移) Regional 20 434 2
远处(转移) Distant 18 957 7
未知 Unstaged 5 025 9
放疗记录 没有/未知 None/Unknown 46 013 0
光束辐射 Beam radiation 13 531 1
放射性植入物 Radioactive implants 20 2
放射性同位素 Radioisotopes 8 3
结合方式 Combination 31 4
未指定辐射 NOS method or source not specified 294 5
拒绝 Refused 692 7
已推荐 Recommended 539 8
化疗记录 没有/未知 No/Unknown 34 745 0
化疗 Yes 23 383 1
5年生存状况 生存 Alive 11 657 1
死亡 Dead 46 471 0
Partial Categorical Variable Data Description
learning_rate 0.1
max_depth 4
min_samples_split 4
n_estimators 100
Some Parameters of GBDT
Variations of Evaluation Indicators for Different Number-based Classifiers
类型 算法 准确率 AUC 特异度 G-mean CK
单模型 DT 0.780 0.667 0.472 0.406 0.329
LR 0.825 0.835 0.274 0.265 0.310
KNN 0.823 0.840 0.352 0.333 0.352
ANN 0.836 0.852 0.380 0.362 0.400
集成模型 RF 0.844 0.877 0.460 0.434 0.457
AdaBoost 0.843 0.876 0.463 0.436 0.457
GBDT 0.848 0.880 0.480 0.452 0.476
数据不平衡处理+集成模型 SMOTETomek+ RF 0.827 0.867 0.666 0.578 0.502
SMOTETomek+ AdaBoost 0.805 0.860 0.708 0.587 0.474
SMOTETomek+ GBDT 0.815 0.868 0.727 0.609 0.498
BalancedRandomForest 0.759 0.851 0.819 0.609 0.432
EasyEnsemble 0.787 0.877 0.819 0.638 0.478
本文 0.808 0.883 0.835 0.650 0.528
Performance Comparison of Algorithms on SEER Gastric Cancer Dataset
Analysis of the Importance of Factors Affecting the 5-years Survival Status of Gastric Cancer
Distribution of SHAP Values of Influencing Factors for 5-years Survival Status of Gastric Cancer
Partial Characteristic SHAP Value Distribution
Example of the Prediction of 5-years Survival Status of Gastric Cancer Patients as “Survival”
Example of the Prediction of 5-years Survival Status of Gastric Cancer Patients as “Death”
[1] Bray F, Ferlay J, Soerjomataram I, et al. Global Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2018, 68(6):394-424.
doi: 10.3322/caac.v68.6
[2] Shin H, Nam Y. A Coupling Approach of a Predictor and a Descriptor for Breast Cancer Prognosis[J]. BMC Medical Genomics, 2014, 7(S1):S4. DOI: 10.1186/1755-8794-7-S1-S4.
doi: 10.1186/1755-8794-7-S1-S4
[3] Allemani C, Matsuda T, Di Carlo V, et al. Global Surveillance of Trends in Cancer Survival 2000-14 (CONCORD-3): Analysis of Individual Records for 37 513 025 Patients Diagnosed with One of 18 Cancers from 322 Population-based Registries in 71 Countries[J]. The Lancet, 2018, 391(10125):1023-1075.
[4] Yang L M, Takimoto T, Fujimoto J. Prognostic Model for Predicting Overall Survival in Children and Adolescents with Rhabdomyosarcoma[J]. BMC Cancer, 2014, 14(1):654.
doi: 10.1186/1471-2407-14-654
[5] Park I, Lee J L, Ryu M H, et al. Prognostic Factors and Predictive Model in Patients with Advanced Biliary Tract Adenocarcinoma Receiving First-Line Palliative Chemotherapy[J]. Cancer: Interdisciplinary International Journal of the American Cancer Society, 2009, 115(18):4148-4155.
[6] 冯婷婷, 凌孙彬, 刘碧霞, 等. 非功能型胰腺神经内分泌肿瘤手术预后分析——一项基于SEER数据库的回顾性研究[J]. 中国肿瘤, 2017, 26(11):910-914.
[6] ( Feng Tingting, Ling Sunbin, Liu Bixia, et al. Prognostic Factors of Long-term Outcome of Non-functional Pancreatic Neuroendocrine Neoplasms Following Surgical Treatment: A Retrospective Study Based on SEER Database[J]. China Cancer, 2017, 26(11):910-914.)
[7] 潘辉, 张亚雷, 肖大凯, 等. 基于SEER数据库构建小细胞肺癌术后患者生存预测模型[J]. 肿瘤预防与治疗, 2019, 32(6):516-523.
[7] ( Pan Hui, Zhang Yalei, Xiao Dakai, et al. Nomogram for Prediction of Survival of Postoperative Small Cell Lung Cancer Patients: An Analysis Based on SEER[J]. Journal of Cancer Control and Treatment, 2019, 32(6):516-523.)
[8] Kim W, Kim K S, Park R W. Nomogram of Naive Bayesian Model for Recurrence Prediction of Breast Cancer[J]. Healthcare Informatics Research, 2016, 22(2):89-94.
doi: 10.4258/hir.2016.22.2.89
[9] Kim W, Kim K S, Lee J E, et al. Development of Novel Breast Cancer Recurrence Prediction Model Using Support Vector Machine[J]. Journal of Breast Cancer, 2012, 15(2):230-238.
doi: 10.4048/jbc.2012.15.2.230
[10] Lynch C M, Abdollahi B, Fuqua J D, et al. Prediction of Lung Cancer Patient Survival via Supervised Machine Learning Classification Techniques[J]. International Journal of Medical Informatics, 2017, 108:1-8.
doi: S1386-5056(17)30236-8 pmid: 29132615
[11] 尹玢璨, 辛世超, 张晗, 等. 基于SEER数据库应用贝叶斯网络构建亚洲肿瘤患者预后模型——以非小细胞肺癌为例[J]. 数据分析与知识发现, 2017, 1(2):41-46.
[11] ( Yin Bincan, Xin Shichao, Zhang Han, et al. Building Asian Tumor-patients Prognostic Model with Bayesian Network and SEER Database——Case Study of Non-small Cell Lung Cancer[J]. Data Analysis and Knowledge Discovery, 2017, 1(2):41-46.)
[12] Hasan M M, Haque M R, Kabir M M J. Breast Cancer Diagnosis Models Using PCA and Different Neural Network Architectures[C]// Proceedings of 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering IC4ME2. IEEE, 2019.
[13] 黄志刚, 刘虹, 刘娟, 等. 基于C5.0算法的胃癌生存预测模型研究[J]. 南京信息工程大学学报(自然科学版), 2017, 9(4):406-410.
[13] ( Huang Zhigang, Liu Hong, Liu Juan, et al. Gastric Cancer Prediction Model Based on C5.0 Classification Algorithm[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(4):406-410.)
[14] Wong M L, Seng K, Wong P K. Cost-sensitive Ensemble of Stacked Denoising Autoencoders for Class Imbalance Problems in Business Domain[J]. Expert Systems with Applications, 2020, 141:112918.
[15] Thabtah F. Machine Learning in Autistic Spectrum Disorder Behavioral Research: A Review and Ways Forward[J]. Informatics for Health and Social Care, 2019, 44(3):278-297.
doi: 10.1080/17538157.2017.1399132 pmid: 29436887
[16] Thabtah F, Hammoud S, Kamalov F, et al. Data Imbalance in Classification: Experimental Evaluation[J]. Information Sciences, 2020, 513:429-441.
doi: 10.1016/j.ins.2019.11.004
[17] Lee H K, Kim S B. An Overlap-sensitive Margin Classifier for Imbalanced and Overlapping Data[J]. Expert Systems with Applications, 2018, 98:72-83.
doi: 10.1016/j.eswa.2018.01.008
[18] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2003: 107-119.
[19] Liu X Y, Wu J, Zhou Z H. Exploratory Undersampling for Class-imbalance Learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2008, 39(2):539-550.
doi: 10.1109/TSMCB.2008.2007853
[20] 章鸣嬛, 张璇, 郭欣, 等. 基于SEER数据库利用机器学习方法分析乳腺癌的预后因素[J]. 北京生物医学工程, 2019, 38(5):486-491, 497.
[20] ( Zhang Minghuan, Zhang Xuan, Guo Xin, et al. Prognostic Factors of Breast Cancer with Machine Learning Method Based on SEER Database[J]. Beijing Biomedical Engineering, 2019, 38(5):486-491, 497.)
[21] Lundberg S M, Lee S I. A Unified Approach to Interpreting Model Predictions[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 4765-4774.
[1] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[2] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[3] Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[4] Yu Bengong,Ji Haomin. Semi-Supervised Method for Text Classification Based on DW-TCI[J]. 数据分析与知识发现, 2020, 4(10): 58-69.
[5] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[6] Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[7] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[8] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[9] Cao Wei,Li Can,He Tingting,Zhu Weidong. Predicting Credit Risks of P2P Loans in China Based on Ensemble Learning Methods[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[10] Li Guolei, Chen Xianlai, Xia Dong, Yang Rong. Latent Semantic Analysis of Electronic Medical Record Text for Clinical Decision Making[J]. 数据分析与知识发现, 2016, 32(3): 50-57.
[11] Wang Huaqiu, Wang Bin, Nie Zhen. Research on Image Semantic Mapping with Multiple-Reservoirs Echo State Network[J]. 现代图书情报技术, 2015, 31(6): 41-48.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938