Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (5): 133-144    DOI: 10.11925/infotech.2096-3467.2022.0603
Current Issue | Archive | Adv Search |
Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet
Jiang Linfu1,Yuan Zhenming1,2,Zhang Xingwei3,Jiang Huaqiang2,Sun Xiaoyan1,2()
1School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China
2Mobile Health Management System Engineering Research Center of the Ministry of Education, Hangzhou 311121, China
3The Affiliated Hospital of Hangzhou Normal University, Hangzhou 310015, China
Download: PDF (2506 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to accurately predict the risk of coronary heart disease and analyze the importance of different factors of coronary heart disease, which helps doctors timely intervene and effectively support patients in prevention and treatment. [Methods] We proposed a coronary heart disease prediction framework based on an attention-interpretable tabular learning neural network (PCHD-TabNet). We used self-supervised learning to help the model accelerate convergence and maintain stability. [Results] The overall performance of PCHD-TabNet was better than other models, and the AUC of the dataset reached 0.72. [Limitations] Framingham data set is routine physical examination data. If there are better clinical data, the predictive performance may be further improved. [Conclusions] Comparative experiments show that the proposed method improves the model’s performance and is superior to other traditional models. This study provides an efficient method for coronary heart disease prediction. It also serves as a reference for similar data mining tasks.

Key wordsCoronary Heart Disease Prediction      TabNet      Machine Learning     
Received: 12 June 2022      Published: 09 November 2022
ZTFLH:  TP399  
  G350  
Fund:Hangzhou Agricultural and Social Development Research Initiative Design Project(20190101A03)
Corresponding Authors: Sun Xiaoyan,ORCID:0000-0002-8781-5303,E-mail:sunxy@hznu.edu.cn。   

Cite this article:

Jiang Linfu, Yuan Zhenming, Zhang Xingwei, Jiang Huaqiang, Sun Xiaoyan. Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet. Data Analysis and Knowledge Discovery, 2023, 7(5): 133-144.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0603     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I5/133

The Overview Flowchart of PCHD-TabNet
字段 描述 单位
male 参与者性别 1=男性,2=女性
education 参与者受教育水平 0=一些高中,1=高中或同等学历,2=一些大学或职业学校,3=学院
BPMeds 检查时使用抗高血压药物 0=当前未使用,1=当前使用
currentSmoke 当前吸烟情况 0=非当前吸烟者,1=当前吸烟者
diabetes 糖尿病患者根据第一次检查的标准进行治疗,或第一次检查时临时血糖为200mg/dL或以上 0=不是糖尿病患者,1=糖尿病患者
prevalentStroker 中风 0=无疾病,1=有疾病
prevalentHyp 高血压,接受治疗或第二次检查时平均收缩压≥140 mmHg或平均舒张压≥90 mmHg 0=无疾病,1=有疾病
TenYearCHD 冠心病定义为既往心绞痛、心肌梗死、冠状动脉功能不全 0=无疾病,1=有疾病
cigsPerDay 每天吸烟数量 0=Not current smoker,1-90 cigarettes per day
totChol 血清总胆固醇(mg/dL) [107,696]
BMI 体重指数(体重千克/身高的平方) [15.54,56.8]
glucose 临时血糖(mg/dL) [40,394.0]
heartRate 心率,拍数/分钟 [44,143]
age 参与者年龄 [32,70]
sysBP 收缩压(三次测量中最后两次的平均值)(mmHg) [83.5,295]
diaBP 舒张压(三次测量中最后两次的平均值)(mmHg) [48,142.5]
Fields Contained in the Dataset
FDI Missing Value Distribution
Analysis of FDI Outliers
Positive and Negative Sample Distribution
TabNet Encoder and Decoder
特征 FDI FDII
不患冠心病
(平均值+标准差)
患病冠心病
(平均值+标准差)
不患冠心病
(平均值+标准差)
患冠心病
(平均值+标准差)
totChol(mg/dL) 235.17±43.52 245.27±47.75 239.8±43.14 251.26±46.89
age(岁) 48.76±8.41 54.15±8.01 53.01±9.07 58.27±8.79
sysBP(mmHg) 130.34±20.45 143.62±26.69 133.00±20.83 145.67±24.49
diaBP(mmHg) 82.17±11.34 86.98±14.03 82.24±11.08 86.48±12.53
cigsPerDay(吸烟数量) 8.72±11.65 10.62±12.99 8.21±11.99 9.41±13.09
BMI(体重千克/身高的平方) 25.67±3.98 26.52±4.49 25.72±3.96 26.78±4.38
heartRate(拍数/分钟) 75.76±11.99 76.53±12.21 76.49±12.22 77.24±12.75
glucose(mg/dL) 80.43±18.07 88.15±39.62 82.06±19.05 87.70±31.43
Statistical Distribution of Continuous Features in Dataset
特征 类别 FDI FDII
患冠心病(644) 不患冠心病(3 596) 患冠心病(1 229) 不患冠心病(8 665)
male 301(12.44%) 2 119(87.56%) 548(9.50%) 5 223(90.50%)
343(18.85%) 1 477(81.15%) 681(16.52%) 3 442(83.48%)
education 一些高中 323(18.78%) 1 397(81.22%) 568(14.94%) 3 234(85.06%)
高中或同等学历 163(12.00%) 1 195(88.00%) 358(10.94%) 2 914(89.06%)
大学或职业学校 88(12.77%) 601(87.23%) 150(9.09%) 1 501(90.91%)
学院 70(14.80%) 403(85.20%) 153(13.09%) 1 016(86.91%)
currentSmoke 333(15.37%) 1 834(84.63%) 677(12.22%) 4 863(87.78%)
311(15.00%) 1 762(85.00%) 552(12.68%) 3 802(87.32%)
prevalentStroke 633(15.02%) 3 582(84.98%) 1 200(12.23%) 8 611(87.77%)
11(44.00%) 14(56.00%) 29(34.94%) 54(65.06%)
prevalentHyp 319(10.91%) 2 604(89.09%) 434(7.60%) 5 279(92.40%)
325(24.68%) 992(75.32%) 795(19.01%) 3 386(80.99%)
diabetes 604(14.62%) 3 527(85.38%) 1 112(11.66%) 8 426(88.34%)
40(36.70%) 69(63.30%) 117(32.87%) 239(67.13%)
BPMeds 603(14.65%) 3 513(85.35%) 1 056(11.47%) 8 150(88.53%)
41(33.06%) 83(66.94%) 173(25.15%) 515(74.85%)
Statistical Distribution of Discrete Features in Dataset
Correlation of Each Feature with the Level of 10-Year CHD Risk
超参数 描述
N_d 决策预测层的宽度 {8,16,32,64}
N_a 每个掩码的注意力嵌入宽度 {8,16,32,64}
N_steps 架构中的步骤数 3~10
lr 学习率 0.02
batch_size 每批数据量的大小 20
The Hyperparameter Settings of PCHD-TabNet
模型 准确率/
%
F1值/% 精确率/
%
召回率/
%
AUC
DT 74.59 18.71 16.16 22.22 0.53
Bagging 64.81 25.53 17.69 45.83 0.60
XGBoost 82.45 15.79 21.43 12.50 0.58
RF 83.00 13.08 20.00 9.72 0.63
LightGBM 83.27 13.27 20.90 9.72 0.58
PCHD-TabNet 60.05 31.40 20.28 69.44 0.67
FDI Experimental Results
模型 准确率/
%
F1值/% 精确率/
%
召回率/
%
AUC
DT 77.91 20.46 16.06 28.17 0.56
Bagging 60.06 21.17 13.21 53.17 0.60
XGBoost 87.96 17.53 28.32 12.70 0.66
RF 87.84 22.05 31.16 17.06 0.69
LightGBM 87.19 19.19 26.39 15.08 0.65
PCHD-TabNet 70.91 30.16 19.90 62.30 0.72
FDII Experimental Results
Comparison of AUC Curves
研究 模型 数据集 数据平衡 特征选择 准确率/% F1值/% 精确率/% 召回率/% AUC
Rajliwall等[24] SVM FDI - 所有特征 90.2 - - - -
Pe等[9] RF FDI 下采样 自动特征选择 84.8 - - - -
Krishnani[2] RF FDI 上采样 自动特征选择 96.8 - - - -
Elsayed等[7] KNN FDI - 特征重要程度 66.7 - - - -
Dogan等[26] Voting Ensemble
Classifer
弗雷明汉DNA
甲基化数据
- - 78 - 75 78 -
Kuruvilla等[25] MLP FDI Over-sample - 84.9 79.8 79 84.9 0.67
本文 PCHD-TabNet FDI SMOTE 所有特征 60.05 31.4 20.28 69.44 0.67
FDI Related Experimental Results
Feature Importance Masks Mask[i] of 10-Year CHD Prediction (That Indicate Which Features are Selected at i Step)
Global Importance of Each Feature for 10-Year Risk Prediction
分类模型 准确率/
%
F1值/
%
精确率/
%
召回率/
%
AUC
RF(先做数据平衡) 91.56 91.64 92.71 90.61 0.915 8
RF(先做数据划分) 83.00 13.08 20.00 9.72 0.631 2
Results of Data Balance and Data Division First
模型 准确率/
%
F1值/
%
精确率/
%
召回率/
%
AUC
KNN 85.56 83.12 82.05 84.21 0.91
DT 76.67 76.40 66.67 89.47 0.78
Logistic 85.56 83.54 80.49 86.84 0.93
Bagging 70.00 64.94 64.10 65.79 0.73
XGBoost 85.56 83.54 80.49 86.84 0.93
RF 84.44 82.05 80.00 84.21 0.94
GBDT 85.56 82.67 83.78 81.58 0.90
LightGBM 83.33 80.52 79.49 81.58 0.93
PCHD-TabNet 89.42 89.71 87.27 92.31 0.95
Test Results of Cleveland Dataset
[1] Narain R, Saxena S, Goyal A K. Cardiovascular Risk Prediction: A Comparative Study of Framingham and Quantum Neural Network Based Approach[J]. Patient Preference and Adherence, 2016, 10: 1259-1270.
doi: 10.2147/PPA.S108203 pmid: 27486312
[2] Krishnani D, Kumari A, Dewangan A, et al. Prediction of Coronary Heart Disease Using Supervised Machine Learning Algorithms[C]// Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019: 367-372.
[3] 中国心血管病风险评估和管理指南编写联合委员会. 中国心血管病风险评估和管理指南[J]. 中华健康管理学杂志, 2019, 13(1): 7-29.
[3] (The Joint Task Force for Guideline on the Assessment and Management of Cardiovascular Risk in China. Guideline on the Assessment and Management of Cardiovascular Risk in China[J]. Chinese Journal of Health Management, 2019, 13(1): 7-29.)
[4] 马婧怡, 刘相佟, 吕世云, 等. 北京市成年人冠心病七年发病风险评估与预测模型[J]. 心肺血管病杂志, 2022, 41(1): 25-30.
[4] (Ma Jingyi, Liu Xiangtong, Lv Shiyun, et al. A Model of 7-Year Risk Assessment and Prediction for Coronary Heart Disease in Adults in Beijing[J]. Journal of Cardiovascular and Pulmonary Diseases, 2022, 41(1): 25-30.)
[5] Terrada O, Hamida S, Cherradi B, et al. Supervised Machine Learning Based Medical Diagnosis Support System for Prediction of Patients with Heart Disease[J]. Advances in Science, Technology and Engineering Systems Journal, 2020, 5(5): 269-277.
[6] Verma L, Srivastava S, Negi P C. A Hybrid Data Mining Model to Predict Coronary Artery Disease Cases Using Non-invasive Clinical Data[J]. Journal of Medical Systems, 2016, 40(7): 178.
doi: 10.1007/s10916-016-0536-z pmid: 27286983
[7] Elsayed H A G, Syed L. An Automatic Early Risk Classification of Hard Coronary Heart Diseases Using Framingham Scoring Model[C]// Proceedings of the 2nd International Conference on Internet of Things, Data and Cloud Computing. 2017: 1-8.
[8] Detrano R, Janosi A, Steinbrunn W, et al. International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease[J]. The American Journal of Cardiology, 1989, 64(5): 304-310.
doi: 10.1016/0002-9149(89)90524-9
[9] Pe R, Subasini C A, Katharine A V, et al. A Cardiovascular Disease Prediction Using Machine Learning Algorithms[J]. Annals of the Romanian Society for Cell Biology, 2021, 25(2): 904-912.
[10] 张慧, 马海娥, 贺晓丹. 血清Ghrelin水平对冠心病发病风险的预测价值分析[J]. 检验医学与临床, 2022, 19(9): 1279-1282.
[10] (Zhang Hui, Ma Haie, He Xiaodan. Predictive Value of Serum Ghrelin Level on the Risk of Coronary Heart Disease[J]. Laboratory Medicine and Clinic, 2022, 19(9): 1279-1282.)
[11] 宋晚美, 刘曦峰, 马小峰. 甘油三酯-血糖指数和甘油三酯-血糖-体质指数指数在冠心病预测和评估中的应用价值[J]. 中国循证心血管医学杂志, 2021, 13(6): 680-683.
[11] (Song Wanmei, Liu Xifeng, Ma Xiaofeng. Application Value of Triglyceride-Glucose Index and Index of Triglyceride-Glucose-Body Mass Index to Prediction and Review of Coronary Heart Disease[J]. Chinese Journal of Evidence-Based Cardiovascular Medicine, 2021, 13(6): 680-683.)
[12] Kukar M, Kononenko I, Grošelj C, et al. Analysing and Improving the Diagnosis of Ischaemic Heart Disease with Machine Learning[J]. Artificial Intelligence in Medicine, 1999, 16(1): 25-50.
pmid: 10225345
[13] Pranatha M D A, Pramaita N, Sudarma M, et al. Filtering Outlier Data Using Box Whisker Plot Method for Fuzzy Time Series Rainfall Forecasting[C]// Proceedings of the 4th International Conference on Wireless and Telematics. 2018: 1-4.
[14] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[15] Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 3149-3157.
[16] Arik S Ö, Pfister T. TabNet: Attentive Interpretable Tabular Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(8): 6679-6687.
[17] He K M, Zhang X Y, Ren S Q, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
[18] Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv:1606.01781.
[19] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin[C]// Proceedings of the 33rd International Conference on Machine Learning. 2016: 173-182.
[20] Polzlbauer G, Lidy T, Rauber A. Decision Manifolds—A Supervised Learning Algorithm Based on Self-organization[J]. IEEE Transactions on Neural Networks, 2008, 19(9): 1518-1530.
doi: 10.1109/TNN.2008.2000449 pmid: 18779085
[21] Dietterich T G. Machine Learning Research: Four Current Directions[J]. Artificial Intelligence Magazine, 1997, 18(4): 97-136.
[22] Friedl M A, Brodley C E. Decision Tree Classification of Land Cover from Remotely Sensed Data[J]. Remote Sensing of Environment, 1997, 61(3): 399-409.
doi: 10.1016/S0034-4257(97)00049-7
[23] Song S, Warren J, Riddle P. Developing High Risk Clusters for Chronic Disease Events with Classification Association Rule Mining[C]// Proceedings of the 7th Australasian Workshop on Health Informatics and Knowledge Management-Volume 153. 2014: 69-78.
[24] Rajliwall N S, Davey R, Chetty G. Machine Learning Based Models for Cardiovascular Risk Prediction[C]// Proceedings of 2018 International Conference on Machine Learning and Data Engineering. 2018: 142-148.
[25] Kuruvilla A M, Balaji N V. Heart Disease Prediction System Using Correlation Based Feature Selection with Multilayer Perceptron Approach[J]. IOP Conference Series: Materials Science and Engineering, 2021, 1085(1): 012028.
doi: 10.1088/1757-899X/1085/1/012028
[26] Dogan M V, Grumbach I M, Michaelson J J, et al. Integrated Genetic and Epigenetic Prediction of Coronary Heart Disease in the Framingham Heart Study[J]. PLoS One, 2018, 13(1): e0190549.
doi: 10.1371/journal.pone.0190549
[1] Wei Huanan, Lei Ming, Wang Xuefeng, Yu Yin. Analyzing Evolution of Basic Research Funding Orientation: Case Study of NSF[J]. 数据分析与知识发现, 2023, 7(5): 10-20.
[2] Lin Weizhen, Liu Hongwei, Chen Yanjun, Wen Zhanming, Yi Minqi. Customer Satisfaction Modelling for Healthcare Wearable Devices Through Online Reviews[J]. 数据分析与知识发现, 2023, 7(5): 145-154.
[3] Lv Qi, Shangguan Yanhong, Zhang Lin, Huang Ying. Interdisciplinary Measurement Based on Automatic Classification of Text Content[J]. 数据分析与知识发现, 2023, 7(4): 56-67.
[4] Qu Zongxi, Sha Yongzhong, Li Yutong. Predicting Major Infectious Diseases Based on Grey Wolf Optimization and Multi-machine Learning: Case Study of COVID-19[J]. 数据分析与知识发现, 2022, 6(8): 122-133.
[5] Zhao Yang, Yan Zhouzhou, Shen Qiqi, Li Zhonghang. Evaluating Privacy Policy for Mobile Health APPs with Machine Learning[J]. 数据分析与知识发现, 2022, 6(5): 112-126.
[6] Wang Lu, Le Xiaoqiu. Research Progress on Citation Analysis of Scientific Papers[J]. 数据分析与知识发现, 2022, 6(4): 1-15.
[7] Wang Ruojia, Yan Chengxi, Guo Fengying, Wang Jimin. Predicting Churners of Online Health Communities Based on the User Persona[J]. 数据分析与知识发现, 2022, 6(2/3): 80-92.
[8] Wu Jinhong, Mu Keliang. Automatic Identifying Abnormal Behaviors of International Journals[J]. 数据分析与知识发现, 2022, 6(2/3): 385-395.
[9] Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[10] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[11] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[12] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[13] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[14] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[15] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn