Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (8): 86-99     https://doi.org/10.11925/infotech.2096-3467.2021.0045
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于集成学习的胃癌生存预测模型研究*
徐良辰,郭崇慧()
大连理工大学系统工程研究所 大连 116024
Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning
Xu Liangchen,Guo Chonghui()
Institute of Systems Engineering, Dalian University of Technology, Dalian 116024, China
全文: PDF (1851 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 基于SEER数据库,构建胃癌5年生存预测模型,提升模型的判别性能,特别是对生存患者的判别能力,并分析胃癌5年生存影响因素,为胃癌预后评价提供支持。【方法】 基于集成学习算法,借鉴EasyEnsemble思想,通过数据层与模型层结合方式处理数据不平衡,基于Bagging方式集成多个Gradient Boosting分类器,据此构建基于不平衡胃癌生存数据的预测模型,并基于SHAP值对胃癌5年生存影响因素进行解释分析。【结果】 本文模型准确率达0.808,AUC为0.883,对小类类别的生存患者预测准确率为0.835,与其他模型相比具有更好的胃癌患者5年生存状况预测性能。此外,计算得出阳性淋巴结数量、肿瘤分期分级以及年龄具有较高的SHAP值。【局限】 SEER数据库统计的相关预后因素有限,一定程度限制了模型的性能,影响预测结果。【结论】 本文模型具有较好的性能,对小类类别的生存患者也具有很好的判别能力。归纳得出阳性淋巴结数量、肿瘤分期分级以及年龄对胃癌患者5年生存概率具有重要影响,符合临床经验。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
徐良辰
郭崇慧
关键词 生存预测集成学习数据不平衡胃癌可解释性    
Abstract

[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited, which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.

Key wordsSurvival Prediction    Ensemble Learning    Data Imbalance    Gastric Cancer    Interpretability
收稿日期: 2021-01-15      出版日期: 2021-04-14
ZTFLH:  R730 G350  
基金资助:*国家自然科学基金项目(71771034);中央高校基本科研业务费资助项目(DUT21YG108)
通讯作者: 郭崇慧 ORCID:0000-0002-5155-1297     E-mail: dlutguo@dlut.edu.cn
引用本文:   
徐良辰, 郭崇慧. 基于集成学习的胃癌生存预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 86-99.
Xu Liangchen, Guo Chonghui. Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning. Data Analysis and Knowledge Discovery, 2021, 5(8): 86-99.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0045      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I8/86
Fig. 1  本文胃癌生存预测研究过程
Fig.2  胃癌5年生存预测模型框架
真实情况 预测结果
正例 反例
正例 TP(真正例) FN(假反例)
反例 FP(假正例) TN(真反例)
Table 1  二分类结果混淆矩阵
Fig.3  SHAP示意模型
编码 位点
C160 贲门,未特指
C161 胃底
C162 胃体
C163 胃窦
C164 幽门
C165 胃小弯,未特指
C166 胃大弯,未特指
C168 胃部的重叠性病灶
C169 胃,未特指
Table 2  胃癌位点编码
数据类型 变量 SEER中字段
类别变量 性别 Sex
种族 Race recode (W, B, AI, API)
地区 State-county
婚姻状况 Marital status at diagnosis
发病部位 Primary Site
组织学形态 Histologic Type ICD-O-3
阶段 Summary stage 2000 (1998+)
组织学分级 Grade
患侧部位 Laterality
放疗记录 Radiation recode
化疗记录 Chemotherapy recode
连续变量 确诊时年龄 Age at diagnosis
阳性淋巴结数量 Regional nodes positive
确诊时间 Year of diagnosis
Table 3  胃癌生存分析相关特征
Fig.4  连续变量数据描述
变量 变量值 数据库对应值 数量 编码
性别 Male 36 452 1
Female 21 676 0
种族 白人 White 41 406 1
黑人 Black 7 750 2
美洲印第安人/阿拉斯加原住民 American Indian/Alaska Native 8 545 3
亚洲或太平洋岛民 Asian or Pacific Islander 427 4
婚姻状况 单身 Single (never married) 7 306 1
已婚 Married (including common law) 34 662 2
分居 Separated 566 3
离婚 Divorced 4 419 4
丧偶 Widowed 11 163 5
未婚或家庭伴侣 Unmarried or Domestic Partner 12 6
组织学分级 I级 Grade I 3 367 1
II级 Grade II 15 577 2
III级 Grade III 37 457 3
IV级 Grade IV 1 727 4
阶段 局部(未扩散) Localized 13 712 1
区域(淋巴结转移) Regional 20 434 2
远处(转移) Distant 18 957 7
未知 Unstaged 5 025 9
放疗记录 没有/未知 None/Unknown 46 013 0
光束辐射 Beam radiation 13 531 1
放射性植入物 Radioactive implants 20 2
放射性同位素 Radioisotopes 8 3
结合方式 Combination 31 4
未指定辐射 NOS method or source not specified 294 5
拒绝 Refused 692 7
已推荐 Recommended 539 8
化疗记录 没有/未知 No/Unknown 34 745 0
化疗 Yes 23 383 1
5年生存状况 生存 Alive 11 657 1
死亡 Dead 46 471 0
Table 4  部分类别变量数据描述
参数
learning_rate 0.1
max_depth 4
min_samples_split 4
n_estimators 100
Table 5  GBDT部分参数
Fig.5  不同个数基分类器评价指标变化图
类型 算法 准确率 AUC 特异度 G-mean CK
单模型 DT 0.780 0.667 0.472 0.406 0.329
LR 0.825 0.835 0.274 0.265 0.310
KNN 0.823 0.840 0.352 0.333 0.352
ANN 0.836 0.852 0.380 0.362 0.400
集成模型 RF 0.844 0.877 0.460 0.434 0.457
AdaBoost 0.843 0.876 0.463 0.436 0.457
GBDT 0.848 0.880 0.480 0.452 0.476
数据不平衡处理+集成模型 SMOTETomek+ RF 0.827 0.867 0.666 0.578 0.502
SMOTETomek+ AdaBoost 0.805 0.860 0.708 0.587 0.474
SMOTETomek+ GBDT 0.815 0.868 0.727 0.609 0.498
BalancedRandomForest 0.759 0.851 0.819 0.609 0.432
EasyEnsemble 0.787 0.877 0.819 0.638 0.478
本文 0.808 0.883 0.835 0.650 0.528
Table 6  算法在SEER胃癌数据集上的性能对比
Fig. 6  胃癌5年生存状况影响因素重要性分析
Fig.7  胃癌5年生存状况影响因素SHAP值分布
Fig.8  部分特征SHAP值分布
Fig.9  胃癌患者5年生存状态预测为“生存”的解释示例
Fig.10  胃癌患者5年生存状态预测为“死亡”的解释示例
[1] Bray F, Ferlay J, Soerjomataram I, et al. Global Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries[J]. CA: A Cancer Journal for Clinicians, 2018, 68(6):394-424.
doi: 10.3322/caac.v68.6
[2] Shin H, Nam Y. A Coupling Approach of a Predictor and a Descriptor for Breast Cancer Prognosis[J]. BMC Medical Genomics, 2014, 7(S1):S4. DOI: 10.1186/1755-8794-7-S1-S4.
doi: 10.1186/1755-8794-7-S1-S4
[3] Allemani C, Matsuda T, Di Carlo V, et al. Global Surveillance of Trends in Cancer Survival 2000-14 (CONCORD-3): Analysis of Individual Records for 37 513 025 Patients Diagnosed with One of 18 Cancers from 322 Population-based Registries in 71 Countries[J]. The Lancet, 2018, 391(10125):1023-1075.
[4] Yang L M, Takimoto T, Fujimoto J. Prognostic Model for Predicting Overall Survival in Children and Adolescents with Rhabdomyosarcoma[J]. BMC Cancer, 2014, 14(1):654.
doi: 10.1186/1471-2407-14-654
[5] Park I, Lee J L, Ryu M H, et al. Prognostic Factors and Predictive Model in Patients with Advanced Biliary Tract Adenocarcinoma Receiving First-Line Palliative Chemotherapy[J]. Cancer: Interdisciplinary International Journal of the American Cancer Society, 2009, 115(18):4148-4155.
[6] 冯婷婷, 凌孙彬, 刘碧霞, 等. 非功能型胰腺神经内分泌肿瘤手术预后分析——一项基于SEER数据库的回顾性研究[J]. 中国肿瘤, 2017, 26(11):910-914.
[6] ( Feng Tingting, Ling Sunbin, Liu Bixia, et al. Prognostic Factors of Long-term Outcome of Non-functional Pancreatic Neuroendocrine Neoplasms Following Surgical Treatment: A Retrospective Study Based on SEER Database[J]. China Cancer, 2017, 26(11):910-914.)
[7] 潘辉, 张亚雷, 肖大凯, 等. 基于SEER数据库构建小细胞肺癌术后患者生存预测模型[J]. 肿瘤预防与治疗, 2019, 32(6):516-523.
[7] ( Pan Hui, Zhang Yalei, Xiao Dakai, et al. Nomogram for Prediction of Survival of Postoperative Small Cell Lung Cancer Patients: An Analysis Based on SEER[J]. Journal of Cancer Control and Treatment, 2019, 32(6):516-523.)
[8] Kim W, Kim K S, Park R W. Nomogram of Naive Bayesian Model for Recurrence Prediction of Breast Cancer[J]. Healthcare Informatics Research, 2016, 22(2):89-94.
doi: 10.4258/hir.2016.22.2.89
[9] Kim W, Kim K S, Lee J E, et al. Development of Novel Breast Cancer Recurrence Prediction Model Using Support Vector Machine[J]. Journal of Breast Cancer, 2012, 15(2):230-238.
doi: 10.4048/jbc.2012.15.2.230
[10] Lynch C M, Abdollahi B, Fuqua J D, et al. Prediction of Lung Cancer Patient Survival via Supervised Machine Learning Classification Techniques[J]. International Journal of Medical Informatics, 2017, 108:1-8.
doi: S1386-5056(17)30236-8 pmid: 29132615
[11] 尹玢璨, 辛世超, 张晗, 等. 基于SEER数据库应用贝叶斯网络构建亚洲肿瘤患者预后模型——以非小细胞肺癌为例[J]. 数据分析与知识发现, 2017, 1(2):41-46.
[11] ( Yin Bincan, Xin Shichao, Zhang Han, et al. Building Asian Tumor-patients Prognostic Model with Bayesian Network and SEER Database——Case Study of Non-small Cell Lung Cancer[J]. Data Analysis and Knowledge Discovery, 2017, 1(2):41-46.)
[12] Hasan M M, Haque M R, Kabir M M J. Breast Cancer Diagnosis Models Using PCA and Different Neural Network Architectures[C]// Proceedings of 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering IC4ME2. IEEE, 2019.
[13] 黄志刚, 刘虹, 刘娟, 等. 基于C5.0算法的胃癌生存预测模型研究[J]. 南京信息工程大学学报(自然科学版), 2017, 9(4):406-410.
[13] ( Huang Zhigang, Liu Hong, Liu Juan, et al. Gastric Cancer Prediction Model Based on C5.0 Classification Algorithm[J]. Journal of Nanjing University of Information Science & Technology (Natural Science Edition), 2017, 9(4):406-410.)
[14] Wong M L, Seng K, Wong P K. Cost-sensitive Ensemble of Stacked Denoising Autoencoders for Class Imbalance Problems in Business Domain[J]. Expert Systems with Applications, 2020, 141:112918.
[15] Thabtah F. Machine Learning in Autistic Spectrum Disorder Behavioral Research: A Review and Ways Forward[J]. Informatics for Health and Social Care, 2019, 44(3):278-297.
doi: 10.1080/17538157.2017.1399132 pmid: 29436887
[16] Thabtah F, Hammoud S, Kamalov F, et al. Data Imbalance in Classification: Experimental Evaluation[J]. Information Sciences, 2020, 513:429-441.
doi: 10.1016/j.ins.2019.11.004
[17] Lee H K, Kim S B. An Overlap-sensitive Margin Classifier for Imbalanced and Overlapping Data[J]. Expert Systems with Applications, 2018, 98:72-83.
doi: 10.1016/j.eswa.2018.01.008
[18] Chawla N V, Lazarevic A, Hall L O, et al. SMOTEBoost: Improving Prediction of the Minority Class in Boosting[C]// Proceedings of European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2003: 107-119.
[19] Liu X Y, Wu J, Zhou Z H. Exploratory Undersampling for Class-imbalance Learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2008, 39(2):539-550.
doi: 10.1109/TSMCB.2008.2007853
[20] 章鸣嬛, 张璇, 郭欣, 等. 基于SEER数据库利用机器学习方法分析乳腺癌的预后因素[J]. 北京生物医学工程, 2019, 38(5):486-491, 497.
[20] ( Zhang Minghuan, Zhang Xuan, Guo Xin, et al. Prognostic Factors of Breast Cancer with Machine Learning Method Based on SEER Database[J]. Beijing Biomedical Engineering, 2019, 38(5):486-491, 497.)
[21] Lundberg S M, Lee S I. A Unified Approach to Interpreting Model Predictions[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 4765-4774.
[1] 王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[2] 邱云飞, 郭蕾. 面向非均衡数据的糖尿病并发症预测[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[3] 余本功,汲浩敏. 基于DW-TCI的半监督文本分类方法研究*[J]. 数据分析与知识发现, 2020, 4(10): 58-69.
[4] 余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[5] 齐惠颖,江雨荷. 基于多组学数据融合构建乳腺癌生存预测模型 *[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
[6] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[7] 肖连杰,郜梦蕊,苏新宁. 一种基于模糊C-均值聚类的欠采样集成不平衡数据分类算法*[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
[8] 操玮, 李灿, 贺婷婷, 朱卫东. 基于集成学习的中国P2P网络借贷信用风险预警模型的对比研究*[J]. 数据分析与知识发现, 2018, 2(10): 65-76.
[9] 李国垒, 陈先来, 夏冬, 杨荣. 面向临床决策的电子病历文本潜在语义分析*[J]. 数据分析与知识发现, 2016, 32(3): 50-57.
[10] 王华秋, 王斌, 聂珍. 一种应用多储备池回声状态网络的图像语义映射研究[J]. 现代图书情报技术, 2015, 31(6): 41-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn