[Objective] This paper proposes a model using machine learning techniques and various omics data, aiming to better predict the survival length of breast cancer patients. [Methods] The prediction model was established with random forest algorithm. It merged four types of omics data, including gene expression, copy number variation, DNA methylation and protein expression of breast cancer cases from TCGA database. [Results] On the test data set, the model’s prediction precision reached 97.22%, and the recall was 98.13%. Compared with the exisiting models, the AUC value of our new algorithm was the highest (0.8393). [Limitations] The sample size needs to be expanded. [Conclusions] The proposed method is an effective way to predict breast cancer patients’ survival length.
齐惠颖,江雨荷. 基于多组学数据融合构建乳腺癌生存预测模型 *[J]. 数据分析与知识发现, 2019, 3(8): 88-93.
Huiying Qi,Yuhe Jiang. Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion. Data Analysis and Knowledge Discovery, 2019, 3(8): 88-93.
( Jia Xiaochen, Jia Yongsheng, Meng Wenjing , et al. Identification of Prognostic Eight-Gene Signature Model in Breast Cancer Using Integrated TCGA Database[J]. Tianjin Medical Journal, 2018,46(8):856-861.)
Xu X, Zhang Y, Zou L, et al. A Gene Signature for Breast Cancer Prognosis Using Support Vector Machine [C]// Proceedings of the 5th International Conference on BioMedical Engineering and Informatics. IEEE, 2013: 928-931.
Kim D, Joung J G, Sohn K A , et al. Knowledge Boosting: A Graph-Based Integration Approach with Multi-Omics Data and Genomic Knowledge for Cancer Clinical Outcome Prediction[J]. Journal of the American Medical Informatics Association, 2015,22(1):109-120.
Kim D, Li R, Lucas A , et al. Using Knowledge-Driven Genomic Interactions for Multi-Omics Data Analysis: Meta Dimensional Models for Predicting Clinical Outcomes in Ovarian Carcinoma[J]. Journal of the American Medical Informatics Association, 2016,24(3):577-587.
Satagopan J M, Venkatraman E S, Begg C B . Two-Stage Designs for Gene-Disease Association Studies with Sample Size Constraints[J]. Biometrics, 2004,60(3):589-597.
Wold S, Esbensen K, Geladi P . Principal Component Analysis[J]. Chemometrics & Intelligent Laboratory Systems, 1987,2(1-3):37-52.
Gao J, Liang F, Fan W , et al. A Graph-Based Consensus Maximization Approach for Combing Multiple Supervised and Unsupervised Models[J]. IEEE Transactions on Knowledge and Data Engineering, 2013,25(1):15-28.
Yu G, Zhu H, Domeniconi C , et al. Integrating Multiple Networks for Protein Function Prediction[J]. BMC Systems Biology, 2015, 9(S1): Article No. S3.
Guo X, Gao L, Liao Q , et al. Long Non-Coding RNAs Function Annotation: A Global Prediction Method Based on Bi-Colored Networks[J]. Nucleic Acids Research, 2013,41(2):e35.