Advanced Search

数据分析与知识发现  2018 , 2 (12): 98-108 https://doi.org/10.11925/infotech.2096-3467.2018.0545

应用论文

基于网络属性的抗肿瘤药物靶点预测方法及其应用*

范馨月, 崔雷

中国医科大学医学信息学院 沈阳 110122

Predicting Antineoplastic Drug Targets Based on Network Properties

Fan Xinyue, Cui Lei

School of Medical Informatics, China Medical University, Shenyang 110122, China

中图分类号:  TP391 G353

通讯作者:  通讯作者: 崔雷, ORCID: 0000-0001-9479-8225, E-mail: lcui@cmu.edu.cn

收稿日期: 2018-05-15

修回日期:  2018-06-4

网络出版日期:  2018-12-25

版权声明:  2018 《数据分析与知识发现》编辑部 《数据分析与知识发现》编辑部

基金资助:  *本文系赛尔网络下一代互联网技术创新项目“面向高等院校的医学影像学教学平台”(项目编号: NGII20150503)的研究成果之一

展开

摘要

【目的】旨在发现潜在的抗肿瘤药物作用靶点, 为日后临床工作及实验验证提供参考。【方法】从DrugBank数据库获取抗肿瘤药物靶点, 结合HPRD数据库中蛋白质相互作用信息, 使用Cytoscape建立药物靶点PPI网络并计算网络节点的拓扑属性, 使用SPSS单因素分析和Weka信息增益原理筛选拓扑属性变量, 采用SMOTE算法处理不平衡数据集问题, 利用决策树方法构建抗肿瘤药物靶点预测模型, 并与其他三种常见的机器学习分类算法模型进行性能比较。【结果】应用决策树算法构建的抗肿瘤药物靶点预测模型的预测准确率达73.18%, 在CBioPortal中验证发现, 结果中预测分数大于等于0.9的16个靶点在多种肿瘤中存在突变和扩增, 并以NR5A1为例进行具体分析。【局限】仅使用抗肿瘤药物靶点的PPI网络属性构建预测模型, 未加入靶点的功能、序列属性等特征。【结论】基于PPI网络的拓扑属性, 采用机器学习方法对潜在的抗肿瘤药物靶点进行预测是有效的, 可以为抗肿瘤药物的研发及临床工作提供一定参考。

关键词: PPI网络 ; 机器学习 ; 决策树 ; 抗肿瘤药靶点预测

Abstract

[Objective] This paper tries to identify potential targets of antineoplastic drugs, aiming to provide references for future clinical work and experiment. [Methods] First, we retrieved the targets of antineoplastic drugs from the DrugBank database, which were also combined with the protein interaction information from the HPRD database. Then, we established the PPI network for these targets with Cytoscape and calculated the topology properties of the nodes. Third, we used SPSS single factor analysis and Weka’s information gain principle to choose the variables for topological attributes. Fourth, we introduced the SMOTE algorithm to process unbalanced data sets and constructed the prediction model for antineoplastic drug targets with the decision tree method. Finally, we compared the performance of our new model with those of the classic ones. [Results] The precision of the proposed model reached 73.18%. With the help of CBioPortal, we found 16 targets’ prediction scores higher than 0.9. These targets could mutate and amplify in various tumors, which were analyzed with the case of NR5A1. [Limitations] The characteristics of target functions, sequence attributes, and other factors should also be included to construct the model. [Conclusions] The proposed model could predict the potential targets of antineoplastic drugs effectively.

Keywords: PPI Network ; Machine Learning ; Decision Tree ; Antineoplastic Drug Targets Prediction

0

PDF (2408KB) 元数据 多维度评价 相关文章 收藏文章

本文引用格式 导出 EndNote Ris Bibtex

范馨月, 崔雷. 基于网络属性的抗肿瘤药物靶点预测方法及其应用*[J]. 数据分析与知识发现, 2018, 2(12): 98-108 https://doi.org/10.11925/infotech.2096-3467.2018.0545

Fan Xinyue, Cui Lei. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. Data Analysis and Knowledge Discovery, 2018, 2(12): 98-108 https://doi.org/10.11925/infotech.2096-3467.2018.0545

1 引 言

近年来, 中国癌症发病率一直呈现上升趋势, 世界卫生组织(World Health Organization, WHO)发布的《世界癌症报告》显示, 中国大陆癌症新诊断病例数居世界之首, 接近全球癌症新发病例的一半[1]。2018年2月, 国家癌症中心发布最新一期的全国癌症统计数据显示, 2014年全国癌症估计新发病例380.4万例, 发病率为278.07/10万, 死亡率为167.89/10万, 这组数据意味着平均每天会有超过1万人、每分钟就有7个人被确诊为癌症[2]

面对如此严峻的发病形势, 癌症的预防和治疗成为整个医疗行业的重中之重。癌症发病机制极其复杂, 致病因素包括不良生活习惯(吸烟、酗酒)、环境污染、免疫缺陷、遗传因素、内分泌因素(体内激素水平)等, 随着基因组测序的完成和科学技术的发展, 发现癌症发生的根本原因是上述因素都在一定程度上改变了细胞的基因, 使其发生癌变。基于这一发现, 针对特定突变基因的癌症靶向治疗逐步成为癌症治疗新的突 破口。

有研究表明, 人类大约有5%-10%的基因都与癌症的发生发展有关[3], 但目前实验验证的癌症基因只占人类基因组1%[4], 而用作抗肿瘤药物靶点的更是少之又少。目前国际上涌现的癌症靶向治疗药物在国内上市的寥寥无几, 以中国发病率和死亡率排在前5位的癌症(肺癌、胃癌、结直肠癌、肝癌、乳腺癌)为例, 其常用靶向治疗药物如表1所示。

表1   肺癌、胃癌、结直肠癌、肝癌、乳腺癌常用靶向药物

   

靶点通用名商品名疾病国内是否上市
EGFR
HER2
Necitumumab(耐昔妥珠单抗)
Osimertinib(奥昔替尼)
Portrazza
Tagrisso(泰瑞莎)
肺癌
AKLCeritinib(色瑞替尼)Zykadia肺癌
Alectinib(艾乐替尼)Alecensa肺癌
Brigatinib(布吉替尼)Alunbrig肺癌
VEGFR2Ramucirumab(雷莫芦单抗)Cyramza肺癌、胃癌、结直肠癌
BRAFDabrafenib(达拉非尼)+
Trametinib(曲美替尼)
Tafinlar+Mekinist肺癌
PD-1Nivolumab(纳武单抗)Opdivo肺癌、结直肠癌、肝癌
Pembrolizumab(派姆单抗)Keytruda(健痊得)肺癌、结直肠癌
PD-L1Atezolizumab(阿特珠单抗)Tecentrip肺癌、胃癌
KIT,PDGFR,
RAF,RET,
VEGFR1/2/3
Regorafenib(瑞戈非尼)Stivarga结直肠癌、肝癌
VEGFA/B,PIGFZiv-aflibercept(阿柏西普)Zaltrap结直肠癌
EGFR,KRASPanitumumab(帕尼单抗)Vectibix结直肠癌
——Trifluridine(曲氟尿苷)Tipiracil结直肠癌
RTK,VEGFLenvatinib(乐伐替尼)Lenvima肝癌
HER2Ado-transtzumab
Emtansine(TDM-1)
Kadcyla乳腺癌
Peryuzumab(帕妥珠单抗)Perjeta乳腺癌
NeratinibNerlynx乳腺癌
CDK4Palbociclib(帕博西尼)Ibrance乳腺癌
CDK6Ribociclib(瑞博西尼)Kisqali乳腺癌
AbemaciclibVerzenio乳腺癌

新窗口打开

可以看出, 目前多种抗肿瘤药的作用靶点仅集中在极少数基因, 还有大量基因可能存在作为抗肿瘤药物靶点的潜力, 但至今没有被发现。针对此现象, 为进一步有效治疗癌症, 发现更多抗肿瘤药物靶点无疑会成为药物开发领域未来研究的重点。

2 研究现状

对于药物靶点的预测, 传统实验方法如联动分析和关联研究等, 会造成时间和金钱的大量浪费, 且结果争议也较大[5]。随着现代计算机技术和生物信息学的发展, 采用机器学习和生物信息学的方法预测药物靶点成为可能, 这不仅能加速药物靶点的发现, 也将大大缩短药物研发周期, 减少研发费用, 并且避免新研发药物可能带来的难以预测的副作用[6]

目前, 国内外很多研究致力于使用机器学习和生物信息学的方法预测药物潜在靶点。尚振伟等[7]从DrugBank和Pfam数据库获得药物靶点和靶点对应的蛋白质家族信息, 基于靶点的一级序列结构, 采用支持向量机(Support Vector Machine, SVM)预测G蛋白偶联受体中潜在的药物靶点; 谢倩倩等[8]将DrugBank中已验证的靶点作为阳性集, 已有研究提取的药物靶点作为阴性集, 利用靶点序列的编码特征, 采用集成学习方法预测离子通道中潜在的药物靶点; 蔡立葛[9]将靶点序列的一级机构和理化性质作为研究变量, 采用SVM探讨失衡数据问题中靶点预测的方法; Carson等[10]应用蛋白质相互作用网络(Protein-Protein Interaction, PPI)的拓扑属性作为研究变量, 采用疾病本体(Disease Ontology, DO)和基因参考功能数据库(Gene Reference Into Function, GeneRIF)预测潜在靶点; Jing等[11]将人工神经网络算法进行改进, 利用深度学习思想预测药物潜在靶点; Ferrero等[12]使用OpenTargets数据库中自带的5个变量, 用SVM、随机森林、神经网络及Gradient Boosting等4种算法预测潜在的药物靶点, 并采用文本挖掘的方法在文献中验证预测得到的潜在药物靶点。上述研究采用机器学习算法的研究变量大都是靶点的序列信息, 仅有少部分研究使用PPI网络的拓扑属性, 且研究方向均是基于所有药物靶点, 并没有集中对某一具体类别的药物靶点进行预测。

综合考虑上述药物靶点预测的局限性, 本研究基于在PPI网络中存在相互作用且网络拓扑属性相似的靶点具有相似性质这一假设[10],选择抗肿瘤药物作为研究对象, 从DrugBank数据库中获取所有抗肿瘤药物靶点, 将PPI网络中靶点的拓扑属性作为研究变量, 采用机器学习算法建立抗肿瘤药物靶点预测模型, 得到潜在的抗肿瘤药物靶点, 并对预测结果进行验证, 为以后的抗肿瘤药物研究提供参考。

3 抗肿瘤药物靶点预测模型

本文整体研究路线如图1所示。

图1   技术路线

   

3.1 数据收集和预处理

(1) 数据下载

选取抗肿瘤药物靶点作为研究对象, 在DrugBank(Version 5.1.0)[13]数据库中, 根据药物的治疗领域, 抗肿瘤药物主要分为4大类:

①Antineoplastic Agents-target;

②Antineoplastic Agents, Phytogenic;

③Antineoplastic Agents, Hormonal;

④Antineoplastic Agents, Alkylating。

共包含327种抗肿瘤药物及615个对应靶点, 将此作为本研究的阳性集; 对于阴性集, 选取DrugBank数据库中除阳性集外的所有药物靶点, 包含Target、Enzyme、Transporter、Carrier这4种类型, 共3 498个。

从HPRD[14]数据库下载人类全部蛋白质-蛋白质相互作用对, 共39 240对, 使用Python从中提取上述4 113个药物靶点之间的相互作用, 得到其中1 739个药物靶点在HPRD中有相互作用信息记载(396个抗肿瘤药物靶点, 1 343个非抗肿瘤药物靶点), 共6 134对相互作用。数据收集具体过程如图2所示。

图2   数据收集流程

   

(2) 构建PPI网络及提取网络属性

将获得的相互作用信息导入Cytoscape[15], 得到包含1 739个节点、6 134条边的PPI网络, 采用网络分析插件(Network Analysis)计算每个节点的网络属性作为模型预测特征, 如表2所示。

表2   蛋白质靶点网络属性及其排序

   

网络属性(预测特征)重要性排序
Average Shortest Path LengthANR
Betweenness CentralityAverage Shortest Path Length
Closeness CentralityDegree
Clustering CoefficientyNumber Of Directed Edges
DegreeStress
EccentricityCloseness Centrality
Number Of Directed EdgesEccentricity
Number Of Undirected EdgesClustering Coefficienty
Partner Of MultiEdgedNodePairsSelfLoops
RadialityTopological Coefficient
SelfLoopsBetweenness Centrality
StressRadiality
Topological Coefficient
ANR

新窗口打开

本研究引入一个新的网络属性值ANR (Antineoplastic Neighbor Ratio), 即每个节点其邻居节点中抗肿瘤药物靶点所占比例, 如公式(1)所示。

$AN{{R}_{i}}=\frac{{{N}_{antineoplastic}}}{\mathop{\sum }_{j=1}^{n}{{A}_{ij}}}$ (1)

其中, n为某一节点的全部邻接节点数目, ${{N}_{antineoplastic}}$为该节点的所有邻居节点中抗肿瘤药物靶点的数目, A表示邻接矩阵, $\sum\nolimits_{j=1}^{n}{{{A}_{ij}}}$等于节点i的度。各个节点ANR的计算由Python实现。

(3) 变量筛选

根据已知信息, 将药物靶点划分为两类:

①class 0: 目前DrugBank中除抗肿瘤药物靶点以外的药物靶点;

②class 1: DrugBank中抗肿瘤药物靶点。

为确定各变量是否独立影响靶点类别,应用SPSS22.0软件对样本进行单因素分析(独立样本t检验或卡方检验), 剔除Partner Of MultiEdgedNodePairs和Number Of Undirected Edges两个变量, 将经单因素分析得到的变量纳入Weka[16], 使用属性选择(Attribute Selection)对上述12个预测变量的重要性进行排序, 其中属性评估器(evaluator)选择InfoGainAttribute Eval, 搜索方法(search)选择Ranker(结果见表2第2列), 根据筛选结果, 本研究最终纳入12个变量训练预测模型。

3.2 构造样本集

(1) 划分训练集与测试集

本研究共有1 739个样本, 将其分为阳性样本(抗肿瘤药物靶点396个)和阴性样本(非抗肿瘤药物靶点 1 343个), 采用十折交叉验证按7:3比例划分为训练集(1 217个样本)和测试集(522个样本)。

(2) SMOTE处理测试集数据不平衡

在训练集的1 217个样本中, 包含281个阳性样本和936个阴性样本, 阴性样本与阳性样本比例将近3:1, 有研究表明当训练集中阴阳样本比例接近1:1时, 可以有效避免大样本的偏倚性[17], 从而得到较好的预测效果, 基于此观点, 本研究采用SMOTE处理训练集。

合成少数类过采样技术(Synthetic Minority Oversampling Technique, SMOTE)[18]是基于随机过采样算法的一种改进, 基本思想是分析少数类样本并根据少数类样本人工合成新样本添加到数据集中, 具体操作如下:

①对于少数类中每一个样本x, 以欧氏距离为标准计算它到少数类样本集中所有样本的距离, 得到其k近邻;

②根据样本不平衡比例设置一个采样比例以确定采样倍率N, 对于每一个少数类样本x, 从其k近邻中随机选择若干个样本, 假设选择的近邻为xn;

③对于每一个随机选出的近邻xn, 分别与原样本x按照公式(2)构建新样本。

$xnew=x+rand(0,1)\times |x-xn|$ (2)

该方法不但使阴阳样本达到平衡(阴阳样本均为936个), 同时可有效避免随机过采样造成的过拟合 问题。

3.3 模型构建及预测结果

本研究采用目前广泛使用的C4.5决策树算法, 该算法根据信息增益率选择分裂属性, 对于属性A, 其分裂信息如公式(3)[19]所示。

$SplitInf{{o}_{A}}(S)=-\sum\nolimits_{j=1}^{m}{\frac{|{{S}_{j}}|}{|S|}{{\log }_{2}}\frac{|{{S}_{j}}|}{|S|}}$ (3)

训练集S通过属性A的属性值划分为m个子数据集, | Sj |表示第j个子数据集中的样本数量, | S |表示划分之前数据集中的样本总数量。

通过属性A分裂之后样本集的信息增益率如公式(4)[20]所示。

$InfoGainRation(S,A)=\frac{InfoGain(S,A)}{SplitInf{{o}_{A}}(S)}$ (4)

依次递归计算后, 选择信息增益率最大的属性作为分裂属性。

使用Python在训练集上进行连续数据的离散化, 然后采用十折交叉验证算法训练决策树模型, 本研究共522个测试集样本, 其中382个样本预测正确, 准确率达73.18%。

3.4 对比实验

本研究另选用贝叶斯网络、支持向量机及人工神经网络建立模型, 并将其预测结果与C4.5决策树算法比较。在Weka中分别选择三种方法所对应的BayesNet、SMO及Multilayer Perceptron建立模型, 所有参数默认, 采用Precision、Recall、F值、AUC及AUPR这5个指标对4种机器学习算法建模的性能比较结果如表3所示。

表3   4种分类算法所建模型预测结果比较

   

算法PrecisionRecallF-measureAUCAUPR
C4.5决策树0.7730.7320.7470.7540.797
人工神经网络0.7840.7450.7590.7530.796
贝叶斯网络0.7580.7800.7640.7520.795
支持向量机0.7840.7430.7570.7010.748

新窗口打开

4 结果分析

使用C4.5决策树算法, 训练集中522个样本有381个预测正确, 有94个原本属于阴性样本的靶点被预测为阳性样本, 这间接的反应这些靶点在某些属性上与抗肿瘤药物靶点相似, 其本身有成为抗肿瘤药物靶点的潜能[10]。针对此观点, 选取16个预测分数大于0.9的靶点(Score≥0.9), 信息汇总如图3所示。对于这些被错误分类的靶点, 基于网络属性相似的靶点可能会有相似作用这一假说, 认为其邻居节点基因在肿瘤组织中的突变及拷贝数的变化会为研究这些节点的功能提供线索。基于此想法, 选取cBioPortal数据库[21]进行结果验证。

图3   Score≥0.9靶点信息汇总

   

CBioPortal(cBio Cancer Genomics Portal)(①http://cbioportal.org/.)汇总了癌症基因图谱(The Cancer Genome Atlas, TCGA)和国际癌症基因组联盟(The International Cancer Genome Consortium, ICGC)等多个大型癌症基因组计划的数据, 整合的基因数据类型包括体细胞突变、DNA拷贝数改变、mRNA和microRNA表达、DNA甲基化、蛋白丰度以及磷蛋白丰度等指标, 并提供这些指标的可视化途径, 被称为“癌症基因组学的终结者”。

本文在CBioPortal中检索图3包含的16个靶点, 将突变率和扩增率最高的肿瘤组织及比率汇总, 如表4所示。

表4   Score≥0.9的药物靶点在癌症组织中突变及扩增情况

   

GeneProteinMutationAmplification
NR5A1Steroidogenic factor 1Cutaneous Melanoma (3.14%)Prostate Cancer, NOS (16.92%)
CSF3RGranulocyte colony-stimulating factor receptorPenile Cancer (14.29%)Ovarian Cancer (5.71%)
NFKB2Nuclear factor NF-kappa-B p100 subunitCholangiocarcinoma (100%)Prostate Cancer, NOS (7.69%)
TNK2Activated CDC42 kinase 1Myelodysplasia (5.56%)Prostate Cancer, NOS (21.54%)
UBCPolyubiquitin-CEndometrial Cancer (2%)Prostate Cancer, NOS (12.31%)
PIK3R2Phosphatidylinositol 3-kinase regulatory subunit betaSmall Bowel Cancer (5.56%)Prostate Cancer, NOS (13.85%)
IDEInsulin-degrading enzymeEndometrial Cancer (3.78%)Prostate Cancer, NOS (7.69%)
PSMB3Proteasome subunit beta type-3Adrenocortical Carcinoma (0.99%)Breast Cancer, NOS (18.75%)
GRM7Metabotropic glutamate receptor 7Ovarian/Fallopian Tube Cancer, NOS (14.29%)Prostate Cancer, NOS (23.08%)
THRAThyroid hormone receptor alphaColorectal Adenocarcinoma(2.91%)Breast Cancer, NOS (18.75%)
MED1Mediator of RNA polymerase II transcription subunit 1Cervical Cancer (4.6%)Breast Cancer, NOS (31.25%)
THRBThyroid hormone receptor betaCutaneous Melanoma (5.23%)Prostate Cancer, NOS (21.54%)
NCS1Neuronal calcium sensor 1Endometrial Cancer (0.59%)Prostate Cancer, NOS (13.85%)
NR3C2Mineralocorticoid receptorOvarian/Fallopian Tube Cancer, NOS (14.29%)Prostate Cancer, NOS (15.38%)
TUBTubby protein homologEndometrial Cancer (4.08%)Prostate Cancer, NOS (9.23%)
IL2Interleukin-2Cutaneous Melanoma (1.05%)Prostate Cancer, NOS (7.69%)

新窗口打开

以NR5A1(Steroidogenic Factor 1)为例, 提取其在PPI网络中的一阶邻居节点建立子网, 如图4所示。

图4   NR5A1一阶邻居子网

   

NR5A1作为该子网的种子节点, Degree=8, 其中有AR、JUN、NCOA1、MAPK1和NFKB1共5个节点为抗肿瘤药物靶点, 在CBioPortal获得5种基因及NR5A1在不同类型癌症组织中的突变及扩增情况, 如图5图10所示。

图5   NR5A1在不同类型癌症中的表达情况

   

图6   NFKB1在不同类型癌症中的表达情况

   

图7   NCOA1在不同类型癌症中的表达情况

   

图8   MAPK1在不同类型癌症中的表达情况

   

图10   JUN在不同类型癌症中的表达情况

   

图5可以看出, NR5A1在黑色素瘤(Melanoma)、肾上腺皮质癌(Adrenocortical Carcinoma)、子宫内膜癌(Endometrial Cancer)、食管胃癌(Esophagogastric Cancer)及结肠直肠腺癌(Colorectal Adenocarcinoma)等5种癌症中有较高的突变率, 且NR5A1虽在前列腺癌中无突变, 但扩增率(即基因拷贝数增加)却达到16.92%。

基于图5结果, 从图6-图10可知AR、JUN、NCOA1、MAPK1、NFKB1等5种基因在上述6种癌症中均有突变, 将各基因在癌症组织中的突变频率汇总如表5所示。可见AR在4种癌症中的突变率均高于其他基因, 但整体来看5种基因在NR5A1突变频率较高的癌症组织中突变率也都排在自身突变率前列, 且除上述提到的5种癌症类型外, 5种基因还均在某种原位癌中突变, NR5A1在原位癌中的扩增达1.03%, 也侧面反映了其与原位癌的发生发展有着密不可分的关系。

表5   5种抗肿瘤药物靶点在不同癌症组织中突变频率

   

癌症类型
基因名称
MelanomaAdrenocortical CarcinomaEndometrial CancerEsophagogastric CancerColorectal AdenocarcinomaCancer of Unknown Primary
AR2.09% (2.79%a,18b)1.97% (1.97%,20)6.08% (6.68%,4)4.09% (4.75%,11)4.52% (4.52%,9)5.14% (5.24%,7)
NCOA12.79% (3.83%,6)1.97% (2.46%,9)5.34% (7.42%,2)2.47% (3.20%,7)3.55% (3.55%,4)4.40% (5.99%,3)
JUN0.35% (1.05%,15)0 (0.99,-)0.59% (0.96%,8)0.76% (1.15%,7)1.94% (1.94%,2)0.47% (3.27%,11)
MAPK11.39% (3.48%,4)0.99% (2.96%,8)1.19% (2.23%,6)0.49% (1.55%,16)0.65% (0.96%,14)0.84% (6.08%,9)
NFKB12.09% (2.79%,4)0.99% (0.99%,8)3.86% (4.15%,2)0.73% (0.89%,13)2.91% (2.91%,3)6.74% (8.23%,1)

(注: a: 基因在癌症组织中(突变+扩增)频率; b: 基因在某一特定组织突变频率排名。)

新窗口打开

与NR5A1相似, AR、NCOA1、JUN、MAPK1、NFKB1这5种基因的所有一阶邻居节点前列腺癌中扩增率也都明显增加, 如表6所示。

表6   前列腺癌中6种基因扩增率

   

基因名称CaseAmpilication Case比例
AR653858.46%
NR5A1651116.92%
NCOA165812.31%
JUN6569.23%
MAPK16557.65%
NFKB16546.15%

新窗口打开

已有研究表明[22]: 基因拷贝数的变化可能是癌症发生与进展的关键所在, 由此可见上述基因都与前 列腺癌的发生发展有一定关联, 需要引起相关研究 重视。

NR5A1与其一阶邻居节点中抗肿瘤药物靶点在不同癌症组织中突变和扩增的相似性, 都表明NR5A1还可能成为新的潜在抗肿瘤药物靶点, 并且间接反映PPI网络的拓扑属性能够很好地反应基因在功能上的相似性, 可用于药物靶点的预测研究。

5 结 语

本研究以抗肿瘤药物为例, 使用机器学习方法, 基于PPI网络的拓扑属性建立抗肿瘤药物靶点预测模型, 准确率达73.18%, 对于分类错误的阴性样本, 根据其预测分数进行结果验证, 进一步证明网络属性相似的靶点作用功能也相似这一假说。在未来研究中, 将会更加深入地研究抗肿瘤药物靶点的结构特征、序列特征、功能特征等, 旨在获得更精确的研究结果, 为临床工作和实验验证提供参考。

作者贡献声明

崔雷: 提出研究思路, 论文最终版本修订;

范馨月: 设计研究方案, 进行实验, 采集、清洗和分析数据, 论文起草。

利益冲突声明

所有作者声明不存在利益冲突关系。

支撑数据

支撑数据由作者自存储, E-mail: 875763928@qq.com。

[1] 范馨月. all_type_target_interaction_network.xlsx. PPI相互作用网络数据集.

[2] 范馨月. all_topological_future_class.csv. 靶点网络拓扑属性.

[3] 范馨月. trainset.csv. test.csv. Balancetrain set.csv. 训练集、测试集及平衡后训练集.

[4] 范馨月. Result.png. 结果验证图片.


参考文献

[1] Allemani C, Matsuda T, Di Carlo V, et al.

Global Surveillance of Trends in Cancer Survival 2000-14 (CONCORD-3): Analysis of Individual Records for 37513025 Patients Diagnosed with One of 18 Cancers from 322 Population-based Registries in 71 Countries

[J]. The Lancet, 2018, 391(10125): 1023-1075.

https://doi.org/10.1016/S0140-6736(17)33326-3      URL      PMID: 29395269      [本文引用: 1]      摘要

Abstract BACKGROUND: In 2015, the second cycle of the CONCORD programme established global surveillance of cancer survival as a metric of the effectiveness of health systems and to inform global policy on cancer control. CONCORD-3 updates the worldwide surveillance of cancer survival to 2014. METHODS: CONCORD-3 includes individual records for 3700·5 million patients diagnosed with cancer during the 15-year period 2000-14. Data were provided by 322 population-based cancer registries in 71 countries and territories, 47 of which provided data with 100% population coverage. The study includes 18 cancers or groups of cancers: oesophagus, stomach, colon, rectum, liver, pancreas, lung, breast (women), cervix, ovary, prostate, and melanoma of the skin in adults, and brain tumours, leukaemias, and lymphomas in both adults and children. Standardised quality control procedures were applied; errors were rectified by the registry concerned. We estimated 5-year net survival. Estimates were age-standardised with the International Cancer Survival Standard weights. FINDINGS: For most cancers, 5-year net survival remains among the highest in the world in the USA and Canada, in Australia and New Zealand, and in Finland, Iceland, Norway, and Sweden. For many cancers, Denmark is closing the survival gap with the other Nordic countries. Survival trends are generally increasing, even for some of the more lethal cancers: in some countries, survival has increased by up to 5% for cancers of the liver, pancreas, and lung. For women diagnosed during 2010-14, 5-year survival for breast cancer is now 8900·5% in Australia and 9000·2% in the USA, but international differences remain very wide, with levels as low as 6600·1% in India. For gastrointestinal cancers, the highest levels of 5-year survival are seen in southeast Asia: in South Korea for cancers of the stomach (6800·9%), colon (7100·8%), and rectum (7100·1%); in Japan for oesophageal cancer (3600·0%); and in Taiwan for liver cancer (2700·9%). By contrast, in the same world region, survival is generally lower than elsewhere for melanoma of the skin (5900·9% in South Korea, 5200·1% in Taiwan, and 4900·6% in China), and for both lymphoid malignancies (5200·5%, 5000·5%, and 3800·3%) and myeloid malignancies (4500·9%, 3300·4%, and 2400·8%). For children diagnosed during 2010-14, 5-year survival for acute lymphoblastic leukaemia ranged from 4900·8% in Ecuador to 9500·2% in Finland. 5-year survival from brain tumours in children is higher than for adults but the global range is very wide (from 2800·9% in Brazil to nearly 80% in Sweden and Denmark). INTERPRETATION: The CONCORD programme enables timely comparisons of the overall effectiveness of health systems in providing care for 18 cancers that collectively represent 75% of all cancers diagnosed worldwide every year. It contributes to the evidence base for global policy on cancer control. Since 2017, the Organisation for Economic Co-operation and Development has used findings from the CONCORD programme as the official benchmark of cancer survival, among their indicators of the quality of health care in 48 countries worldwide. Governments must recognise population-based cancer registries as key policy tools that can be used to evaluate both the impact of cancer prevention strategies and the effectiveness of health systems for all patients diagnosed with cancer. FUNDING: American Cancer Society; Centers for Disease Control and Prevention; Swiss Re; Swiss Cancer Research foundation; Swiss Cancer League; Institut National du Cancer; La Ligue Contre le Cancer; Rossy Family Foundation; US National Cancer Institute; and the Susan G Komen Foundation. Copyright 0008 2018 Elsevier Ltd. All rights reserved.
[2] 陈万青, 孙可欣, 郑荣寿, .

2014年中国分地区恶性肿瘤发病和死亡分析

[J]. 中国肿瘤, 2018, 27(1): 1-14.

URL      [本文引用: 1]      摘要

[目的]根据2017年全国肿瘤登记中心收集的全国恶性肿瘤登记资料分析估计我国2014年东、中、西部地区恶性肿瘤的发病与死亡情况。[方法 ]按照全国肿瘤登记中心制定的审核方法和评价标准对全国上报2014年肿瘤登记数据的449个登记处数据进行评估,339个登记处的数据符合标准。将入选的登记处按地理位置(东部、中部、西部)、性别、年龄及不同肿瘤类型分层计算发病率和死亡率,结合2014年全国人口数据,估计全国恶性肿瘤发病、死亡情况。人口标准化率按照2000年中国标准人口结构(中标率)和Segi’s世界标准人口结构(世标率)进行计算。[结果]2014年纳入分析的339个登记处共覆盖登记人口288 243 347人(其中男性146 203 891人,女性142 039 456人)。据估计,全国2014年新发恶性肿瘤病例约380.4万例,死亡病例229.6万例。肿瘤发病率为278.07/10万,中标率为190.63/10万,世标率为186.53/10万;肿瘤死亡率为167.89/10万,中标率为106.98/10万,世标率为106.09/10万。东、中、西部地区的恶性肿瘤发病率分别为306.84/10万、273.42/10万、246.38/10万,世标率分别为192.60/10万、188.23/10万、175.93/10万;东、中、西部地区的恶性肿瘤死亡率为181.01/10万、167.31/10万、151.65/10万,世标率分别为104.48/10万、109.69/10万、103.79/10万。各地区肿瘤年龄别发病率、死亡率趋势相似。肺癌、结直肠癌、胃癌、肝癌在东、中、西部地区均较常见,东、西部地区女性乳腺癌较常见,中部地区食管癌较常见。东、中、西部地区主要肿瘤死因均为肺癌、肝癌、胃癌、结直肠癌和食管癌。[结论]我国东、中、西部地区肿瘤负担存在差异,应根据实际情况在不同地区有重点地开展肿瘤防治工作。

(Chen Wanqing, Sun Kexin, Zheng Rongshou, et al.

Report of Cancer Incidence and Mortality in Different Areas of China, 2014

[J]. China Cancer, 2018, 27(1): 1-14.)

URL      [本文引用: 1]      摘要

[目的]根据2017年全国肿瘤登记中心收集的全国恶性肿瘤登记资料分析估计我国2014年东、中、西部地区恶性肿瘤的发病与死亡情况。[方法 ]按照全国肿瘤登记中心制定的审核方法和评价标准对全国上报2014年肿瘤登记数据的449个登记处数据进行评估,339个登记处的数据符合标准。将入选的登记处按地理位置(东部、中部、西部)、性别、年龄及不同肿瘤类型分层计算发病率和死亡率,结合2014年全国人口数据,估计全国恶性肿瘤发病、死亡情况。人口标准化率按照2000年中国标准人口结构(中标率)和Segi’s世界标准人口结构(世标率)进行计算。[结果]2014年纳入分析的339个登记处共覆盖登记人口288 243 347人(其中男性146 203 891人,女性142 039 456人)。据估计,全国2014年新发恶性肿瘤病例约380.4万例,死亡病例229.6万例。肿瘤发病率为278.07/10万,中标率为190.63/10万,世标率为186.53/10万;肿瘤死亡率为167.89/10万,中标率为106.98/10万,世标率为106.09/10万。东、中、西部地区的恶性肿瘤发病率分别为306.84/10万、273.42/10万、246.38/10万,世标率分别为192.60/10万、188.23/10万、175.93/10万;东、中、西部地区的恶性肿瘤死亡率为181.01/10万、167.31/10万、151.65/10万,世标率分别为104.48/10万、109.69/10万、103.79/10万。各地区肿瘤年龄别发病率、死亡率趋势相似。肺癌、结直肠癌、胃癌、肝癌在东、中、西部地区均较常见,东、西部地区女性乳腺癌较常见,中部地区食管癌较常见。东、中、西部地区主要肿瘤死因均为肺癌、肝癌、胃癌、结直肠癌和食管癌。[结论]我国东、中、西部地区肿瘤负担存在差异,应根据实际情况在不同地区有重点地开展肿瘤防治工作。
[3] Futreal P A, Coin L, Marshall M, et al.

A Census of Human Cancer Genes

[J]. Nature Reviews Cancer, 2004, 4(3): 177-183.

https://doi.org/10.1038/nrc1299      URL      PMID: 14993899      [本文引用: 1]      摘要

Nat Rev Cancer. 2004 Mar;4(3):177-83. Review
[4] Strausberg R L, Simpson A J, Wooster R.

Sequence-based Cancer Genomics: Progress, Lessons and Opportunities

[J]. Nature Reviews Genetics, 2003, 4(6): 409-418.

https://doi.org/10.1038/nrg1085      URL      PMID: 12776211      [本文引用: 1]      摘要

Abstract Technologies that provide a genome-wide view offer an unprecedented opportunity to scrutinize the molecular biology of the cancer cell. The information that is derived from these technologies is well suited to the development of public databases of alterations in the cancer genome and its expression. Here, we describe the synergistic efforts of research programmes in Brazil, the United Kingdom and the United States towards building integrated databases that are widely accessible to the research community, to enable basic and applied applications in cancer research.
[5] Ostlund G, Lindskog M, Sonnhammer E L.

Network-based Identification of Novel Cancer Genes

[J]. Molecular & Cellular Proteomics, 2010, 9(4): 648-655.

https://doi.org/10.1074/mcp.M900227-MCP200      URL      PMID: 2860235      [本文引用: 1]      摘要

Genes involved in cancer susceptibility and progression can serve as templates for searching protein networks for novel cancer genes. To this end, we introduce a general network searching method, MaxLink, and apply it to find and rank cancer gene candidates by their connectivity to known cancer genes. Using a comprehensive protein interaction network, we searched for genes connected to known cancer genes. First, we compiled a new set of 812 genes involved in cancer, more than twice the number in the Cancer Gene Census. Their network neighbors were then extracted. This candidate list was refined by selecting genes with unexpectedly high levels of connectivity to cancer genes and without previous association to cancer. This produced a list of 1891 new cancer candidates with up to 55 connections to known cancer genes. We validated our method by cross-validation, Gene Ontology term bias, and differential expression in cancer versus normal tissue. An example novel cancer gene candidate is presented with detailed analysis of the local network and neighbor annotation. Our study provides a ranked list of high priority targets for further studies in cancer research. Supplemental material is included.
[6] Li L, Zhang K, Lee J, et al.

Discovering Cancer Genes by Integrating Network and Functional Properties

[J]. BMC Medical Genomics, 2009, 2: 61-74.

https://doi.org/10.1186/1755-8794-2-61      URL      PMID: 2758898      [本文引用: 1]      摘要

Background Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes. Methods Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1. Results Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1. Conclusion Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations.
[7] 尚振伟, 李晋, 姜永帅, .

基于SVM的药物靶点预测方法及其应用

[J]. 现代生物医学进展, 2012, 12(20): 3943-3946.

https://doi.org/10.3969/j.issn.1004-1346.2014.08.015      URL      [本文引用: 1]      摘要

目的:基于已知药物靶点和潜在药物靶点蛋白的一级结构相似性,结合SVM技术研究新的有效的药物靶点预测 方法.方法:构造训练样本集,提取蛋白质序列的一级结构特征,进行数据预处理,选择最优核函数,优化参数并进行特征选择,训练最优预测模型,检验模型的预 测效果.以G蛋白偶联受体家族的蛋白质为预测集,应用建立的最优分类模型对其进行潜在药物靶点挖掘.结果:基于SVM所建立的最优分类模型预测的平均准确 率为81.03%.应用最优分类器对构造的G蛋白预测集进行预测,结果发现预测排位在前20的蛋白质中有多个与疾病相关.特别的,其中有两个G蛋白在治疗 靶点数据库(TTD)中显示已作为临床试验的药物靶点.结论:基于SVM和蛋白质序列特征的药物靶点预测方法是有效的,应用该方法预测出的潜在药物靶点能 够为发现新的药靶提供参考.

(Shang Zhenwei, Li Jin, Jiang Yongshuai, et al.

A Method of Drug Target Prediction Based on SVM and Its Application

[J]. Progress in Modern Biomedicine, 2012, 12(20): 3943-3946.)

https://doi.org/10.3969/j.issn.1004-1346.2014.08.015      URL      [本文引用: 1]      摘要

目的:基于已知药物靶点和潜在药物靶点蛋白的一级结构相似性,结合SVM技术研究新的有效的药物靶点预测 方法.方法:构造训练样本集,提取蛋白质序列的一级结构特征,进行数据预处理,选择最优核函数,优化参数并进行特征选择,训练最优预测模型,检验模型的预 测效果.以G蛋白偶联受体家族的蛋白质为预测集,应用建立的最优分类模型对其进行潜在药物靶点挖掘.结果:基于SVM所建立的最优分类模型预测的平均准确 率为81.03%.应用最优分类器对构造的G蛋白预测集进行预测,结果发现预测排位在前20的蛋白质中有多个与疾病相关.特别的,其中有两个G蛋白在治疗 靶点数据库(TTD)中显示已作为临床试验的药物靶点.结论:基于SVM和蛋白质序列特征的药物靶点预测方法是有效的,应用该方法预测出的潜在药物靶点能 够为发现新的药靶提供参考.
[8] 谢倩倩, 李订芳, 章文.

基于集成学习的离子通道药物靶点预测

[J]. 计算机科学, 2015, 42(4): 177-180.

https://doi.org/10.11896/j.issn.1002-137X.2015.4.035      URL      [本文引用: 1]      摘要

新药研制成功的关键在于药物靶点的发现和准确定位。在已知的药物靶点中,离子通道蛋白是一类广受欢迎的靶点,它与免疫系统、心血管等疾病密切相关。对于靶点的发现,传统生物方法成本高、耗时久。因此,探讨了基于机器学习的离子通道蛋白药物靶点的挖掘,以加快药物靶点发现过程,节约经费。由于药物靶点相关序列的长度不一致,考虑了蛋白质序列编码的13种特征,它们能将不等长的蛋白质序列转化成等长序列。通过数值实验筛选能够较好地区分靶点和非靶点的特征子集,并采用集成学习的方法整合特征得到预测模型。通过与已有工作的比较表明,提出的集成模型能得到较高的准确率,具有很好的应用前景。

(Xie Qianqian, Li Dingfang, Zhang Wen.

Predicting Potential Drug Targets for Ion Channel Proteins Based on Ensemble Learning

[J]. Computer Science, 2015, 42(4): 177-180.)

https://doi.org/10.11896/j.issn.1002-137X.2015.4.035      URL      [本文引用: 1]      摘要

新药研制成功的关键在于药物靶点的发现和准确定位。在已知的药物靶点中,离子通道蛋白是一类广受欢迎的靶点,它与免疫系统、心血管等疾病密切相关。对于靶点的发现,传统生物方法成本高、耗时久。因此,探讨了基于机器学习的离子通道蛋白药物靶点的挖掘,以加快药物靶点发现过程,节约经费。由于药物靶点相关序列的长度不一致,考虑了蛋白质序列编码的13种特征,它们能将不等长的蛋白质序列转化成等长序列。通过数值实验筛选能够较好地区分靶点和非靶点的特征子集,并采用集成学习的方法整合特征得到预测模型。通过与已有工作的比较表明,提出的集成模型能得到较高的准确率,具有很好的应用前景。
[9] 蔡立葛.

基于失衡数据挖掘的药物靶点预测方法研究[D]

. 哈尔滨: 哈尔滨理工大学, 2017.

[本文引用: 1]     

(Cai Lige.

Research on the Prediction of Drug Targets Based on Imbalance Data Mining[D]

. Harbin: Harbin University of Science and Technology, 2017.)

[本文引用: 1]     

[10] Carson M B, Lu H.

Network-based Prediction and Knowledge Mining of Disease Genes

[J]. BMC Medical Genomics, 2015, 8(S2): S9.

https://doi.org/10.1186/1755-8794-8-S2-S9      URL      PMID: 4460923      [本文引用: 3]      摘要

In recent years, high-throughput protein interaction identification methods have generated a large amount of data. When combined with the results from other in vivo and in vitro experiments, a complex set of relationships between biological molecules emerges. The growing popularity of network analysis and data mining has allowed researchers to recognize indirect connections between these molecules. Due to the interdependent nature of network entities, evaluating proteins in this context can reveal relationships that may not otherwise be evident. We examined the human protein interaction network as it relates to human illness using the Disease Ontology. After calculating several topological metrics, we trained an alternating decision tree (ADTree) classifier to identify disease-associated proteins. Using a bootstrapping method, we created a tree to highlight conserved characteristics shared by many of these proteins. Subsequently, we reviewed a set of non-disease-associated proteins that were misclassified by the algorithm with high confidence and searched for evidence of a disease relationship. Our classifier was able to predict disease-related genes with 79% area under the receiver operating characteristic (ROC) curve (AUC), which indicates the tradeoff between sensitivity and specificity and is a good predictor of how a classifier will perform on future data sets. We found that a combination of several network characteristics including degree centrality, disease neighbor ratio, eccentricity, and neighborhood connectivity help to distinguish between disease- and non-disease-related proteins. Furthermore, the ADTree allowed us to understand which combinations of strongly predictive attributes contributed most to protein-disease classification. In our post-processing evaluation, we found several examples of potential novel disease-related proteins and corresponding literature evidence. In addition, we showed that first- and second-order neighbors in the PPI network could be used to identify likely disease associations. We analyzed the human protein interaction network and its relationship to disease and found that both the number of interactions with other proteins and the disease relationship of neighboring proteins helped to determine whether a protein had a relationship to disease. Our classifier predicted many proteins with no annotated disease association to be disease-related, which indicated that these proteins have network characteristics that are similar to disease-related proteins and may therefore have disease associations not previously identified. By performing a post-processing step after the prediction, we were able to identify evidence in literature supporting this possibility. This method could provide a useful filter for experimentalists searching for new candidate protein targets for drug repositioning and could also be extended to include other network and data types in order to refine these predictions.
[11] Jing Y, Bian Y, Hu Z, et al.

Deep Learning for Drug Design: An Artificial Intelligence Paradigm for Drug Discovery in the Big Data Era

[J]. The AAPS Journal, 2018, 20(3): 58.

https://doi.org/10.1208/s12248-018-0210-0      URL      PMID: 29943256      [本文引用: 1]      摘要

Over the last decade, deep learning (DL) methods have been extremely successful and widely used to develop artificial intelligence (AI) in almost every domain, especially after it achieved its proud...
[12] Ferrero E, Dunham I, Sanseau P.

In Silico Prediction of Novel Therapeutic Targets Using Gene-Disease Association Data

[J]. Journal of Translational Medicine, 2017, 15(1): 182.

https://doi.org/10.1186/s12967-017-1285-6      URL      PMID: 28851378      [本文引用: 1]      摘要

Target identification and validation is a pressing challenge in the pharmaceutical industry, with many of the programmes that fail for efficacy reasons showing poor association between the drug target and the disease. Computational prediction of successful targets could have a considerable impact on attrition rates in the drug discovery pipeline by significantly reducing the initial search space. Here, we explore whether gene-disease association data from the Open Targets platform is sufficient to predict therapeutic targets that are actively being pursued by pharmaceutical companies or are already on the market. To test our hypothesis, we train four different classifiers (a random forest, a support vector machine, a neural network and a gradient boosting machine) on partially labelled data and evaluate their performance using nested cross-validation and testing on an independent set. We then select the best performing model and use it to make predictions on more than 15,000 genes. Finally, we validate our predictions by mining the scientific literature for proposed therapeutic targets. We observe that the data types with the best predictive power are animal models showing a disease-relevant phenotype, differential expression in diseased tissue and genetic association with the disease under investigation. On a test set, the neural network classifier achieves over 71% accuracy with an AUC of 0.76 when predicting therapeutic targets in a semi-supervised learning setting. We use this model to gain insights into current and failed programmes and to predict 1431 novel targets, of which a highly significant proportion has been independently proposed in the literature. Our in silico approach shows that data linking genes and diseases is sufficient to predict novel therapeutic targets effectively and confirms that this type of evidence is essential for formulating or strengthening hypotheses in the target discovery process. Ultimately, more rapid and automated target prioritisation holds the potential to reduce both the costs and the development times associated with bringing new medicines to patients.
[13] Wishart D S, Knox C, Guo A C, et al.

DrugBank: A Knowledgebase for Drugs, Drug Actions and Drug Targets

[J]. Nucleic Acids Research, 2008, 36(Database Issue): 901-906.

https://doi.org/10.1093/nar/gkm958      URL      PMID: 18048412      [本文引用: 1]      摘要

DrugBank is a richly annotated resource that combines detailed drug data with comprehensive drug target and drug action information. Since its first release in 2006, DrugBank has been widely used to facilitate in silico drug target discovery, drug design, drug docking or screening, drug metabolism prediction, drug interaction prediction and general pharmaceutical education. The latest version of DrugBank (release 2.0) has been expanded significantly over the previous release. With approximately 4900 drug entries, it now contains 60% more FDA-approved small molecule and biotech drugs including 10% more 'experimental' drugs. Significantly, more protein target data has also been added to the database, with the latest version of DrugBank containing three times as many non-redundant protein or drug target sequences as before (1565 versus 524). Each DrugCard entry now contains more than 100 data fields with half of the information being devoted to drug/chemical data and the other half devoted to pharmacological, pharmacogenomic and molecular biological data. A number of new data fields, including food-drug interactions, drug-drug interactions and experimental ADME data have been added in response to numerous user requests. DrugBank has also significantly improved the power and simplicity of its structure query and text query searches. DrugBank is available at http://www.drugbank.ca.
[14] Keshava Prasad T S, Goel R, Kandasamy K, et al.

Human Protein Reference Database

[J]. Nucleic Acids Research, 2008, 37(S1): 767-772.

https://doi.org/10.1038/nrg1266      URL      [本文引用: 1]      摘要

This resource depicts information on human protein functions including , , enzyme-substrate relationships and disease associations. Protein annotation information that is catalogued was derived through manual curation using published literature by expert biologists and through bioinformatics analyses of the protein sequence. The protein–protein interaction and subcellular localization data from HPRD have been used to develop a human protein interaction network.
[15] Shannon P, Markiel A, Ozier O, et al.

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

[J]. Genome Research, 2003, 13(11): 2498-2504.

https://doi.org/10.1101/gr.1239303      URL      [本文引用: 1]     

[16] Hall M, Frank E, Holmes G, et al.

The WEKA Data Mining Software: An Update

[J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1): 10-18.

https://doi.org/10.1145/1656274      URL      [本文引用: 1]     

[17] Han L, Cui J, Lin H, et al.

Recent Progresses in the Application of Machine Learning Approach for Predicting Protein Functional Class Independent of Sequence Similarity

[J]. Proteomics, 2006, 6(14): 4023-4037.

https://doi.org/10.1002/pmic.200500938      URL      PMID: 16791826      [本文引用: 1]      摘要

Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
[18] Chawla N V, Bowyer K W, Hall L O, et al.

SMOTE: Synthetic Minority Over-Sampling Technique

[J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.

https://doi.org/10.1613/jair.953      URL      [本文引用: 1]      摘要

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.Bowyer, K W; Chawla, N V; Hall, L O; Kegelmeyer, W P
[19] 杜景林, 严蔚岚.

基于距离权值的C4.5组合决策树算法

[J]. 计算机工程与设计, 2018, 39(1): 96-102.

URL      [本文引用: 1]      摘要

针对C4.5决策树算法在处理多维数据分类时,没有考虑各属性对分类结果的影响,导致分类准确率低的问题,提出一种基于距离权值的C4.5组合决策树算法.根据标准欧式距离定义数据属性的距离权值,更新C4.5决策树算法的信息增益率,得到基于距离权值的C4.5算法.利用改进后的C4.5决策树分类算法训练多个基分类器,基分类器通过Bagging集成方法构建组合决策树.实验结果表明,该算法在处理多维数据时有较高的准确性和稳定性.

(Du Jinglin, Yan Weilan.

Multiple Classifiers of C4.5 Decision Tree Based on Distance Weight

[J]. Computer Engineering and Design , 2018, 39(1): 96-102.)

URL      [本文引用: 1]      摘要

针对C4.5决策树算法在处理多维数据分类时,没有考虑各属性对分类结果的影响,导致分类准确率低的问题,提出一种基于距离权值的C4.5组合决策树算法.根据标准欧式距离定义数据属性的距离权值,更新C4.5决策树算法的信息增益率,得到基于距离权值的C4.5算法.利用改进后的C4.5决策树分类算法训练多个基分类器,基分类器通过Bagging集成方法构建组合决策树.实验结果表明,该算法在处理多维数据时有较高的准确性和稳定性.
[20] 黄秀霞, 孙力.

C4.5算法的优化

[J]. 计算机工程与设计, 2016, 37(5): 1265-1270.

[本文引用: 1]     

(Huang Xiuxia, Sun Li.

Optimization of C4.5 Algorithm

[J]. Computer Engineering and Design, 2016, 37(5): 1265-1270.)

[本文引用: 1]     

[21] Cerami E, Gao J, Dogrusoz U, et al.

The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data

[J]. Cancer Discovery, 2012, 2(5): 401-404.

https://doi.org/10.1158/2159-8290.CD-12-0095      URL      [本文引用: 1]     

[22] Delaney J R, Patel C B, Willis K M, et al.

Haploinsufficiency Networks Identify Targetable Patterns of Allelic Deficiency in Low Mutation Ovarian Cancer

[J]. Nature Communications, 2017, 8: Article No.14423.

https://doi.org/10.1038/ncomms14423      URL      PMID: 28198375      [本文引用: 1]      摘要

Abstract Identification of specific oncogenic gene changes has enabled the modern generation of targeted cancer therapeutics. In high-grade serous ovarian cancer (OV), the bulk of genetic changes is not somatic point mutations, but rather somatic copy-number alterations (SCNAs). The impact of SCNAs on tumour biology remains poorly understood. Here we build haploinsufficiency network analyses to identify which SCNA patterns are most disruptive in OV. Of all KEGG pathways (N=187), autophagy is the most significantly disrupted by coincident gene deletions. Compared with 20 other cancer types, OV is most severely disrupted in autophagy and in compensatory proteostasis pathways. Network analysis prioritizes MAP1LC3B (LC3) and BECN1 as most impactful. Knockdown of LC3 and BECN1 expression confers sensitivity to cells undergoing autophagic stress independent of platinum resistance status. The results support the use of pathway network tools to evaluate how the copy-number landscape of a tumour may guide therapy.
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn

/