Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (8): 88-93    DOI: 10.11925/infotech.2096-3467.2019.0021
Current Issue | Archive | Adv Search |
Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion
Huiying Qi1(),Yuhe Jiang2
1School of Health Humanities, Peking University, Beijing 100191, China
2Health Science Center, Peking University, Beijing 100191, China
Download: PDF (493 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a model using machine learning techniques and various omics data, aiming to better predict the survival length of breast cancer patients. [Methods] The prediction model was established with random forest algorithm. It merged four types of omics data, including gene expression, copy number variation, DNA methylation and protein expression of breast cancer cases from TCGA database. [Results] On the test data set, the model’s prediction precision reached 97.22%, and the recall was 98.13%. Compared with the exisiting models, the AUC value of our new algorithm was the highest (0.8393). [Limitations] The sample size needs to be expanded. [Conclusions] The proposed method is an effective way to predict breast cancer patients’ survival length.

Key wordsOmics Data Fusion      Random Forest      Breast Cancer Survival Prediction     
Received: 07 January 2019      Published: 29 September 2019
ZTFLH:  TP391 G35  
Corresponding Authors: Huiying Qi     E-mail: qhy@bjmu.edu.cn

Cite this article:

Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion. Data Analysis and Knowledge Discovery, 2019, 3(8): 88-93.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0021     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I8/88

数据种类 数量 说明
临床数据 1 098 去除了识别标志的临床数据和人口统计数据, 包括病人基本信息、诊治情况、TNM分期、肿瘤病历、生存情况等, 这些数据以XML和Biotab格式保存。
基因表达数据 1 092 细胞在生命过程中, 把储存在DNA顺序中的遗传信息经过转录和翻译, 转变为具有生物活性的蛋白质分子, 研究表达模式有助于癌症的诊断。
蛋白质表达数据 1 098 蛋白质表达在癌症的发生和预后表现出明显的差异性。
拷贝数变异数据 1 098 是一种亚显微水平下的基因组结构变异, 在肿瘤遗传变异中起重要作用。
甲基化数据 1 095 DNA甲基化程序的变化会引起基因表达失调, 当抑癌基因发生异常甲基化时会引起表达失调进而使得癌细胞的繁殖失去控制以及转移扩散。
数据类型 原始特征数目 最优特征数目
拷贝数变异 24 776 20
蛋白质表达 215 50
基因表达 15 972 35
DNA甲基化 16 474 30
样本实际值 预测值 预测是否正确 预测结果
Positive Positive TRUE Positive(TP)
Negative Positive FALSE Positive(FP)
Positive Negative FALSE Negative(FN)
Negative Negative TRUE Negative(TN)
TP FP TN FN 精确率 召回率 F1值
105 3 105 2 0.9722 0.9813 0.9767
组学数据 AUC值
拷贝数变异+蛋白质表达+基因表达+DNA甲基化 0.8393
拷贝数变异+蛋白质表达+基因表达 0.8174
拷贝数变异+蛋白质表达+DNA甲基化 0.8066
拷贝数变异+基因表达+DNA甲基化 0.7913
蛋白质表达+基因表达+DNA甲基化 0.8303
[1] 世卫组织: 2018 年全球最新癌症报告[EB/OL]. [2019-01-02].
[1] ( WHO: Global Latest Cancer Report 2018[EB/OL].[ 2019-01-02]. )
[2] 国家癌症中心: 2017最新中国肿瘤现状和趋势[EB/OL]. [ 2019- 01- 02].
[2] ( National Cancer Center: The Latest Cancer Status and Trends in China in 2017[EB/OL]. [ 2019- 01- 02].
[3] Hidalgo S J T, Ma S . Clustering Multilayer Omics Data Using MuNCut[J]. BMC Genomics, 2018,19(1):198.
[4] Van De Vijver M J, He Y D, Van’t Veer L J , et al. A Gene-Expression Signature as a Predictor of Survival in Breast Cancer[J]. New England Journal of Medicine, 2002,347(25):1999-2009.
[5] 贾晓晨, 贾勇圣, 孟文静 , 等. 基于TCGA数据库建立的八基因预后模型在乳腺癌中的应用[J]. 天津医药, 2018,46(8):856-861.
[5] ( Jia Xiaochen, Jia Yongsheng, Meng Wenjing , et al. Identification of Prognostic Eight-Gene Signature Model in Breast Cancer Using Integrated TCGA Database[J]. Tianjin Medical Journal, 2018,46(8):856-861.)
[6] Xu X, Zhang Y, Zou L, et al. A Gene Signature for Breast Cancer Prognosis Using Support Vector Machine [C]// Proceedings of the 5th International Conference on BioMedical Engineering and Informatics. IEEE, 2013: 928-931.
[7] Kim D, Joung J G, Sohn K A , et al. Knowledge Boosting: A Graph-Based Integration Approach with Multi-Omics Data and Genomic Knowledge for Cancer Clinical Outcome Prediction[J]. Journal of the American Medical Informatics Association, 2015,22(1):109-120.
[8] Kim D, Li R, Lucas A , et al. Using Knowledge-Driven Genomic Interactions for Multi-Omics Data Analysis: Meta Dimensional Models for Predicting Clinical Outcomes in Ovarian Carcinoma[J]. Journal of the American Medical Informatics Association, 2016,24(3):577-587.
[9] Satagopan J M, Venkatraman E S, Begg C B . Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints[J]. Biometrics, 2004,60(3):589-597.
[10] Wold S, Esbensen K, Geladi P . Principal Component Analysis[J]. Chemometrics & Intelligent Laboratory Systems, 1987,2(1-3):37-52.
[11] Gao J, Liang F, Fan W , et al. A Graph-Based Consensus Maximization Approach for Combing Multiple Supervised and Unsupervised Models[J]. IEEE Transactions on Knowledge and Data Engineering, 2013,25(1):15-28.
[12] Yu G, Zhu H, Domeniconi C , et al. Integrating Multiple Networks for Protein Function Prediction[J]. BMC Systems Biology, 2015, 9(S1): Article No. S3.
[13] Guo X, Gao L, Liao Q , et al. Long Non-Coding RNAs Function Annotation: A Global Prediction Method Based on Bi-Colored Networks[J]. Nucleic Acids Research, 2013,41(2):e35.
[1] Liu Yuanchen, Wang Hao, Gao Yaqi. Predicting Online Music Playbacks and Influencing Factors[J]. 数据分析与知识发现, 2021, 5(8): 100-112.
[2] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[3] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[4] Zhou Cheng,Wei Hongqin. Identifying Crowd Participants with Modified Random Forests Algorithm[J]. 数据分析与知识发现, 2018, 2(7): 46-54.
[5] Chen Yuan,Wang Chaoqun,Hu Zhongyi,Wu Jiang. Identifying Malicious Websites with PCA and Random Forest Methods[J]. 数据分析与知识发现, 2018, 2(4): 71-80.
[6] Zhang Liyi,Li Yiran,Wen Xuan. Predicting Repeat Purchase Intention of New Consumers[J]. 数据分析与知识发现, 2018, 2(11): 10-18.
[7] Lv Weimin,Wang Xiaomei,Han Tao. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 38-45.
[8] Yuan Xinwei,Yang Shaohua,Wang Chaochao,Du Zhanhe. Identifying Lead Players of User Innovation Communities Based on Feature Extraction and Random Forest Classification[J]. 数据分析与知识发现, 2017, 1(11): 62-74.
[9] Zhang Liyi, Zhang Jiao. A Brusher Detection Method Based on Principle Component Analysis and Random Forest[J]. 现代图书情报技术, 2015, 31(10): 65-71.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn