Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (11): 93-102    DOI: 10.11925/infotech.2096-3467.2022.0196
GNN-MTB: An Anti-Mycobacterium Drug Virtual Screening Model Based on Graph Neural Network
Gu Yaowen,Zheng Si,Yang Fengchun,Li Jiao()
Institute of Medical Information, Chinese Academy of Medical Sciences / Peking Union Medical College, Beijing 100020, China
[Objective] This study constructs a virtual screening model for anti-tuberculosis drugs aiming to support the research and development of new medicine. [Methods] We proposed a curriculum learning-optimized graph neural network model for anti-tuberculosis inhibitors virtual screening (GNN-MTB). Then, we created a benchmark dataset for anti-tuberculosis drugs from the open access databases. Finally, we compared the performance of the GNN-MTB with four classic machine learning models and two graph neural network models on the benchmark dataset of 10,789 records. [Results] The proposed GNN-MTB model’s AUC score reached 0.912 and its AUPR score was 0.679, which were higher than those of the classic models. The maximum improvement of our method in AUC and AUPR were 3.872% and 13.167%. The GNN-MTB is made open source and could be found at [Limitations] The proposed model needs to add the analysis data on drug sensitivity and bacterial resistance. [Conclusions] The proposed GNN-MTB model benefits the development of anti-tuberculosis drug screening. This method could also create drug virtual screening models for other diseases.

Key wordsGraph Neural Network      Curriculum Learning      Mycobacterium Tuberculosis      Virtual Screening     
Received: 09 March 2022      Published: 13 January 2023
ZTFLH:  R961  
Fund:CAMS Innovation Fund for Medical Sciences (CIFMS)(2021-I2M-1-056);CAMS Innovation Fund for Medical Sciences (CIFMS)(2018-I2M-AI-016);National Key R&D Program of China(2016YFC0901901)
Gu Yaowen,Zheng Si,Yang Fengchun,Li Jiao. GNN-MTB: An Anti-Mycobacterium Drug Virtual Screening Model Based on Graph Neural Network. Data Analysis and Knowledge Discovery, 2022, 6(11): 93-102.

Process of Building Anti-Tuberculosis Drug Virtual Screening Model
Proportion of Value Types in Anti-Tuberculosis Dataset
Kernel Density Estimation Distribution of Anti-Tuberculosis MIC Values
SMILES表示化合物 标签
CC(C)c1csc(C(=O)NN)n1 0
COc1cc2ccc(=O)oc2cc1O 0
NNC(=O)c1ccncc1 1
Cc1sc(N)nc1C(=O)O 1
NC(=O)c1cnccn1 0
Diagram of Anti-Tuberculosis Dataset
Interface of GNN-MTB Based Anti-Tuberculosis Drug Virtual Screening Tool
类型 模型 AUC AUPR F1分数
机器学习 RF 0.897±0.011 0.634±0.012 0.620±0.015
SVM 0.894±0.008 0.647±0.022 0.624±0.010
MLP 0.896±0.011 0.649±0.013 0.614±0.015
GBDT 0.897±0.008 0.673±0.017 0.631±0.019
图神经网络 GAT 0.900±0.013* 0.656±0.027 0.609±0.048
MPNN 0.878±0.014 0.600±0.033 0.595±0.039
GNN-MTB 0.912±0.010* 0.679±0.017 0.643±0.032
Comparison of Model Performance Result
The ROC Curve of Different Models
The PR Curve of Different Models
Precision Score on Top300 Results Predicted by Different Models
类型 模型 AUC AUPR F1分数
机器学习 RF 0.541 0.306 0.248
SVM 0.569 0.335 0.222
MLP 0.641 0.370 0.354
GBDT 0.545 0.305 0.122
图神经网络 GAT 0.691 0.463 0.400
MPNN 0.552 0.318 0.178
GNN-MTB 0.683 0.462 0.526
Model Performance Result (External Validation Set)
SMILES 真实活性值 标签 预测概率
Cc1ccn2nc(C)c(C(=O)NCc3ccc(N4CCC(c5ccc(OC(F)(F)F)cc5)CC4)cc3)c2n1 57.340 0 0.912
Cc1cn2nc(C)c(C(=O)NCc3ccc(N4CCC(c5ccc(OC(F)(F)F)cc5)CC4)cc3)c2s1 0.060 1 0.866
Cc1nn2ccsc2c1C(=O)NCc1ccc(N2CCC(c3ccc(OC(F)(F)F)cc3)CC2)cc1 0.580 1 0.821
Cc1ccc2[nH]c(C)c(C(=O)NCc3ccc(N4CCC(c5ccc(OC(F)(F)F)cc5)CC4)cc3)c2c1 19.190 0 0.771
Cc1nn2c(C)csc2c1C(=O)NCc1ccc(N2CCC(c3ccc(OC(F)(F)F)cc3)CC2)cc1 0.190 1 0.699
COc1nc2ccc(Br)cc2cc1[C@@H](c1ccccc1)[C@@](O)(CCN(C)C)c1cccc2ccccc12 0.450 1 0.682
Cc1c(-c2ccc(N3CCC(C(F)(F)F)CC3)cc2)[nH]c2cc(F)cc(F)c2c1=O 10.000 0 0.667
COc1ccc(CN2CC3(CCN(c4nc(=O)c5cc(C(F)(F)F)cc([N+](=O)[O-])c5s4)CC3)C2)cc1 0.110 1 0.650
COc1ccc(CN2CCC3(CCN(c4nc(=O)c5cc(C(F)(F)F)cc([N+](=O)[O-])c5s4)CC3)C2)cc1 0.110 1 0.648
O=c1nc(N2CCN(CC3CCCCC3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.040 1 0.631
COc1ccc(COc2nc3ccc(Br)cc3cc2-c2ccc(CN(C)C)cc2)cc1 10.900 0 0.625
O=c1nc(N2CCC3(CCN(Cc4ccc(C(F)(F)F)cc4)CC3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.040 1 0.622
COc1ccc(CN2CCC3(CC2)CCN(c2nc(=O)c4cc(C(F)(F)F)cc([N+](=O)[O-])c4s2)CC3)cc1 1.010 0 0.620
COc1cccc(COc2nc3ccc(Br)cc3cc2-c2ccc(CN(C)C)cc2)c1 1.200 0 0.619
CN(C)Cc1ccc(-c2cc3cc(Br)ccc3nc2OCc2ccncc2)cc1 0.950 1 0.617
O=c1nc(N2CCC3(CCN(Cc4ccc(Br)cc4)CC3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.060 1 0.613
CN(C)Cc1ccc(-c2cc3cc(Br)ccc3nc2OCc2ccc(Cl)cc2)cc1 1.100 0 0.600
Cc1ccc2oc(C)c(C(=O)NCc3ccc(N4CCC(c5ccc(OC(F)(F)F)cc5)CC4)cc3)c2c1 57.450 0 0.587
CCOC(=O)CN1CC2(CCN(c3nc(=O)c4cc(C(F)(F)F)cc([N+](=O)[O-])c4s3)CC2)C1 0.440 1 0.586
COc1cc(CN2CCC3(CC2)CCN(c2nc(=O)c4cc(C(F)(F)F)cc([N+](=O)[O-])c4s2)CC3)cc(OC)c1 0.820 1 0.575
O=c1nc(N2CCC3(CC2)CN(CC2CCCCC2)C3)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.470 1 0.574
CN(C)Cc1ccc(-c2cc3cc(Br)ccc3nc2OCc2c(F)cccc2F)cc1 3.100 0 0.570
O=c1nc(N2CCC3(CCN(Cc4ccc(F)cc4)C3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 1.680 0 0.566
CN(C)c1cc2[nH]c(C3CCC(F)(F)CC3)nc2cc1NC(=O)c1ccc(OC(F)(F)F)cc1 2.590 0 0.564
CN(C)Cc1ccc(-c2cc3cc(Br)ccc3nc2OCc2cccc(F)c2)cc1 1.500 0 0.564
CCOC(=O)CN1CCC2(CCN(c3nc(=O)c4cc(C(F)(F)F)cc([N+](=O)[O-])c4s3)CC2)C1 0.060 1 0.564
CN1CCC2(CC1)CCN(c1nc(=O)c3cc(C(F)(F)F)cc([N+](=O)[O-])c3s1)CC2 33.540 0 0.560
O=c1nc(N2CCC3(CCN(Cc4ccccc4)C3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.060 1 0.559
O=c1nc(N2CCC3(CCN(CC4CCCCC4)C3)CC2)sc2c([N+](=O)[O-])cc(C(F)(F)F)cc12 0.220 1 0.558
COc1nc2ccc(Br)cc2cc1-c1ccc(CN(C)C)cc1 15.900 0 0.546
Top30 Prediction Results of GNN-MTB Model on External Validation Set
True Activity Values on External Validation Set
