Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (5): 133-144     https://doi.org/10.11925/infotech.2096-3467.2022.0603
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于PCHD-TabNet的十年冠心病预测*
蒋林甫1,袁贞明1,2,张邢炜3,姜华强2,孙晓燕1,2()
1杭州师范大学信息科学与技术学院 杭州 311121
2移动健康管理系统教育部工程研究中心 杭州 311121
3杭州师范大学附属医院 杭州 310015
Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet
Jiang Linfu1,Yuan Zhenming1,2,Zhang Xingwei3,Jiang Huaqiang2,Sun Xiaoyan1,2()
1School of Information Science and Technology, Hangzhou Normal University, Hangzhou 311121, China
2Mobile Health Management System Engineering Research Center of the Ministry of Education, Hangzhou 311121, China
3The Affiliated Hospital of Hangzhou Normal University, Hangzhou 310015, China
全文: PDF (2506 KB)   HTML ( 26
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 准确预测人们患冠心病的风险,分析不同因素对冠心病影响的重要程度,以便医生及时干预,有效帮助患者预防以及治疗冠心病。【方法】 提出一种基于注意力可解释表格学习神经网络的冠心病预测框架(PCHD-TabNet),并且使用自监督学习帮助模型加速收敛并保持稳定性。【结果】 PCHD-TabNet整体效果优于其他模型,且数据集的AUC达到0.72。【局限】 弗雷明汉数据集的特征都是常规体检数据,如果有更好的临床数据,预测效果也许会有进一步的提升。【结论】 所提方法提高了模型的性能,并且也优于其他传统模型,为冠心病预测提供了一种高效的方法,并为类似的数据挖掘任务提供了参考。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
蒋林甫
袁贞明
张邢炜
姜华强
孙晓燕
关键词 冠心病预测TabNet机器学习    
Abstract

[Objective] This paper tries to accurately predict the risk of coronary heart disease and analyze the importance of different factors of coronary heart disease, which helps doctors timely intervene and effectively support patients in prevention and treatment. [Methods] We proposed a coronary heart disease prediction framework based on an attention-interpretable tabular learning neural network (PCHD-TabNet). We used self-supervised learning to help the model accelerate convergence and maintain stability. [Results] The overall performance of PCHD-TabNet was better than other models, and the AUC of the dataset reached 0.72. [Limitations] Framingham data set is routine physical examination data. If there are better clinical data, the predictive performance may be further improved. [Conclusions] Comparative experiments show that the proposed method improves the model’s performance and is superior to other traditional models. This study provides an efficient method for coronary heart disease prediction. It also serves as a reference for similar data mining tasks.

Key wordsCoronary Heart Disease Prediction    TabNet    Machine Learning
收稿日期: 2022-06-12      出版日期: 2022-11-09
ZTFLH:  TP399  
  G350  
基金资助:*杭州市科技发展计划项目的研究成果之一(20190101A03)
通讯作者: 孙晓燕,ORCID:0000-0002-8781-5303,E-mail:sunxy@hznu.edu.cn。   
引用本文:   
蒋林甫, 袁贞明, 张邢炜, 姜华强, 孙晓燕. 基于PCHD-TabNet的十年冠心病预测*[J]. 数据分析与知识发现, 2023, 7(5): 133-144.
Jiang Linfu, Yuan Zhenming, Zhang Xingwei, Jiang Huaqiang, Sun Xiaoyan. Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet. Data Analysis and Knowledge Discovery, 2023, 7(5): 133-144.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0603      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I5/133
Fig.1  PCHD-TabNet流程图
字段 描述 单位
male 参与者性别 1=男性,2=女性
education 参与者受教育水平 0=一些高中,1=高中或同等学历,2=一些大学或职业学校,3=学院
BPMeds 检查时使用抗高血压药物 0=当前未使用,1=当前使用
currentSmoke 当前吸烟情况 0=非当前吸烟者,1=当前吸烟者
diabetes 糖尿病患者根据第一次检查的标准进行治疗,或第一次检查时临时血糖为200mg/dL或以上 0=不是糖尿病患者,1=糖尿病患者
prevalentStroker 中风 0=无疾病,1=有疾病
prevalentHyp 高血压,接受治疗或第二次检查时平均收缩压≥140 mmHg或平均舒张压≥90 mmHg 0=无疾病,1=有疾病
TenYearCHD 冠心病定义为既往心绞痛、心肌梗死、冠状动脉功能不全 0=无疾病,1=有疾病
cigsPerDay 每天吸烟数量 0=Not current smoker,1-90 cigarettes per day
totChol 血清总胆固醇(mg/dL) [107,696]
BMI 体重指数(体重千克/身高的平方) [15.54,56.8]
glucose 临时血糖(mg/dL) [40,394.0]
heartRate 心率,拍数/分钟 [44,143]
age 参与者年龄 [32,70]
sysBP 收缩压(三次测量中最后两次的平均值)(mmHg) [83.5,295]
diaBP 舒张压(三次测量中最后两次的平均值)(mmHg) [48,142.5]
Table 1  数据集中包含的字段
Fig.2  FDI数据缺失值分布
Fig.3  FDI异常值分析
Fig.4  正负样本分布
Fig.5  TabNet编码器和解码器
特征 FDI FDII
不患冠心病
(平均值+标准差)
患病冠心病
(平均值+标准差)
不患冠心病
(平均值+标准差)
患冠心病
(平均值+标准差)
totChol(mg/dL) 235.17±43.52 245.27±47.75 239.8±43.14 251.26±46.89
age(岁) 48.76±8.41 54.15±8.01 53.01±9.07 58.27±8.79
sysBP(mmHg) 130.34±20.45 143.62±26.69 133.00±20.83 145.67±24.49
diaBP(mmHg) 82.17±11.34 86.98±14.03 82.24±11.08 86.48±12.53
cigsPerDay(吸烟数量) 8.72±11.65 10.62±12.99 8.21±11.99 9.41±13.09
BMI(体重千克/身高的平方) 25.67±3.98 26.52±4.49 25.72±3.96 26.78±4.38
heartRate(拍数/分钟) 75.76±11.99 76.53±12.21 76.49±12.22 77.24±12.75
glucose(mg/dL) 80.43±18.07 88.15±39.62 82.06±19.05 87.70±31.43
Table 2  数据集中连续特征统计分布
特征 类别 FDI FDII
患冠心病(644) 不患冠心病(3 596) 患冠心病(1 229) 不患冠心病(8 665)
male 301(12.44%) 2 119(87.56%) 548(9.50%) 5 223(90.50%)
343(18.85%) 1 477(81.15%) 681(16.52%) 3 442(83.48%)
education 一些高中 323(18.78%) 1 397(81.22%) 568(14.94%) 3 234(85.06%)
高中或同等学历 163(12.00%) 1 195(88.00%) 358(10.94%) 2 914(89.06%)
大学或职业学校 88(12.77%) 601(87.23%) 150(9.09%) 1 501(90.91%)
学院 70(14.80%) 403(85.20%) 153(13.09%) 1 016(86.91%)
currentSmoke 333(15.37%) 1 834(84.63%) 677(12.22%) 4 863(87.78%)
311(15.00%) 1 762(85.00%) 552(12.68%) 3 802(87.32%)
prevalentStroke 633(15.02%) 3 582(84.98%) 1 200(12.23%) 8 611(87.77%)
11(44.00%) 14(56.00%) 29(34.94%) 54(65.06%)
prevalentHyp 319(10.91%) 2 604(89.09%) 434(7.60%) 5 279(92.40%)
325(24.68%) 992(75.32%) 795(19.01%) 3 386(80.99%)
diabetes 604(14.62%) 3 527(85.38%) 1 112(11.66%) 8 426(88.34%)
40(36.70%) 69(63.30%) 117(32.87%) 239(67.13%)
BPMeds 603(14.65%) 3 513(85.35%) 1 056(11.47%) 8 150(88.53%)
41(33.06%) 83(66.94%) 173(25.15%) 515(74.85%)
Table 3  数据集中离散特征统计分布
Fig.6  每个特征与10年冠心病风险水平的相关性
超参数 描述
N_d 决策预测层的宽度 {8,16,32,64}
N_a 每个掩码的注意力嵌入宽度 {8,16,32,64}
N_steps 架构中的步骤数 3~10
lr 学习率 0.02
batch_size 每批数据量的大小 20
Table 4  PCHD-TabNet框架的超参数设置
模型 准确率/
%
F1值/% 精确率/
%
召回率/
%
AUC
DT 74.59 18.71 16.16 22.22 0.53
Bagging 64.81 25.53 17.69 45.83 0.60
XGBoost 82.45 15.79 21.43 12.50 0.58
RF 83.00 13.08 20.00 9.72 0.63
LightGBM 83.27 13.27 20.90 9.72 0.58
PCHD-TabNet 60.05 31.40 20.28 69.44 0.67
Table 5  FDI实验结果
模型 准确率/
%
F1值/% 精确率/
%
召回率/
%
AUC
DT 77.91 20.46 16.06 28.17 0.56
Bagging 60.06 21.17 13.21 53.17 0.60
XGBoost 87.96 17.53 28.32 12.70 0.66
RF 87.84 22.05 31.16 17.06 0.69
LightGBM 87.19 19.19 26.39 15.08 0.65
PCHD-TabNet 70.91 30.16 19.90 62.30 0.72
Table 6  FDII 实验结果
Fig.7  AUC曲线对比
研究 模型 数据集 数据平衡 特征选择 准确率/% F1值/% 精确率/% 召回率/% AUC
Rajliwall等[24] SVM FDI - 所有特征 90.2 - - - -
Pe等[9] RF FDI 下采样 自动特征选择 84.8 - - - -
Krishnani[2] RF FDI 上采样 自动特征选择 96.8 - - - -
Elsayed等[7] KNN FDI - 特征重要程度 66.7 - - - -
Dogan等[26] Voting Ensemble
Classifer
弗雷明汉DNA
甲基化数据
- - 78 - 75 78 -
Kuruvilla等[25] MLP FDI Over-sample - 84.9 79.8 79 84.9 0.67
本文 PCHD-TabNet FDI SMOTE 所有特征 60.05 31.4 20.28 69.44 0.67
Table 7  FDI 数据集相关实验结果
Fig.8  特征重要性屏蔽Mask[i]的10年冠心病预测(表明在第i步选择了哪些特征)
Fig.9  每个特征对10年风险预测的全局重要性
分类模型 准确率/
%
F1值/
%
精确率/
%
召回率/
%
AUC
RF(先做数据平衡) 91.56 91.64 92.71 90.61 0.915 8
RF(先做数据划分) 83.00 13.08 20.00 9.72 0.631 2
Table 8  对比先做数据平衡和先做数据划分的结果
模型 准确率/
%
F1值/
%
精确率/
%
召回率/
%
AUC
KNN 85.56 83.12 82.05 84.21 0.91
DT 76.67 76.40 66.67 89.47 0.78
Logistic 85.56 83.54 80.49 86.84 0.93
Bagging 70.00 64.94 64.10 65.79 0.73
XGBoost 85.56 83.54 80.49 86.84 0.93
RF 84.44 82.05 80.00 84.21 0.94
GBDT 85.56 82.67 83.78 81.58 0.90
LightGBM 83.33 80.52 79.49 81.58 0.93
PCHD-TabNet 89.42 89.71 87.27 92.31 0.95
Table 9  Cleveland数据集测试结果
[1] Narain R, Saxena S, Goyal A K. Cardiovascular Risk Prediction: A Comparative Study of Framingham and Quantum Neural Network Based Approach[J]. Patient Preference and Adherence, 2016, 10: 1259-1270.
doi: 10.2147/PPA.S108203 pmid: 27486312
[2] Krishnani D, Kumari A, Dewangan A, et al. Prediction of Coronary Heart Disease Using Supervised Machine Learning Algorithms[C]// Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019: 367-372.
[3] 中国心血管病风险评估和管理指南编写联合委员会. 中国心血管病风险评估和管理指南[J]. 中华健康管理学杂志, 2019, 13(1): 7-29.
[3] (The Joint Task Force for Guideline on the Assessment and Management of Cardiovascular Risk in China. Guideline on the Assessment and Management of Cardiovascular Risk in China[J]. Chinese Journal of Health Management, 2019, 13(1): 7-29.)
[4] 马婧怡, 刘相佟, 吕世云, 等. 北京市成年人冠心病七年发病风险评估与预测模型[J]. 心肺血管病杂志, 2022, 41(1): 25-30.
[4] (Ma Jingyi, Liu Xiangtong, Lv Shiyun, et al. A Model of 7-Year Risk Assessment and Prediction for Coronary Heart Disease in Adults in Beijing[J]. Journal of Cardiovascular and Pulmonary Diseases, 2022, 41(1): 25-30.)
[5] Terrada O, Hamida S, Cherradi B, et al. Supervised Machine Learning Based Medical Diagnosis Support System for Prediction of Patients with Heart Disease[J]. Advances in Science, Technology and Engineering Systems Journal, 2020, 5(5): 269-277.
[6] Verma L, Srivastava S, Negi P C. A Hybrid Data Mining Model to Predict Coronary Artery Disease Cases Using Non-invasive Clinical Data[J]. Journal of Medical Systems, 2016, 40(7): 178.
doi: 10.1007/s10916-016-0536-z pmid: 27286983
[7] Elsayed H A G, Syed L. An Automatic Early Risk Classification of Hard Coronary Heart Diseases Using Framingham Scoring Model[C]// Proceedings of the 2nd International Conference on Internet of Things, Data and Cloud Computing. 2017: 1-8.
[8] Detrano R, Janosi A, Steinbrunn W, et al. International Application of a New Probability Algorithm for the Diagnosis of Coronary Artery Disease[J]. The American Journal of Cardiology, 1989, 64(5): 304-310.
doi: 10.1016/0002-9149(89)90524-9
[9] Pe R, Subasini C A, Katharine A V, et al. A Cardiovascular Disease Prediction Using Machine Learning Algorithms[J]. Annals of the Romanian Society for Cell Biology, 2021, 25(2): 904-912.
[10] 张慧, 马海娥, 贺晓丹. 血清Ghrelin水平对冠心病发病风险的预测价值分析[J]. 检验医学与临床, 2022, 19(9): 1279-1282.
[10] (Zhang Hui, Ma Haie, He Xiaodan. Predictive Value of Serum Ghrelin Level on the Risk of Coronary Heart Disease[J]. Laboratory Medicine and Clinic, 2022, 19(9): 1279-1282.)
[11] 宋晚美, 刘曦峰, 马小峰. 甘油三酯-血糖指数和甘油三酯-血糖-体质指数指数在冠心病预测和评估中的应用价值[J]. 中国循证心血管医学杂志, 2021, 13(6): 680-683.
[11] (Song Wanmei, Liu Xifeng, Ma Xiaofeng. Application Value of Triglyceride-Glucose Index and Index of Triglyceride-Glucose-Body Mass Index to Prediction and Review of Coronary Heart Disease[J]. Chinese Journal of Evidence-Based Cardiovascular Medicine, 2021, 13(6): 680-683.)
[12] Kukar M, Kononenko I, Grošelj C, et al. Analysing and Improving the Diagnosis of Ischaemic Heart Disease with Machine Learning[J]. Artificial Intelligence in Medicine, 1999, 16(1): 25-50.
pmid: 10225345
[13] Pranatha M D A, Pramaita N, Sudarma M, et al. Filtering Outlier Data Using Box Whisker Plot Method for Fuzzy Time Series Rainfall Forecasting[C]// Proceedings of the 4th International Conference on Wireless and Telematics. 2018: 1-4.
[14] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[15] Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 3149-3157.
[16] Arik S Ö, Pfister T. TabNet: Attentive Interpretable Tabular Learning[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(8): 6679-6687.
[17] He K M, Zhang X Y, Ren S Q, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
[18] Conneau A, Schwenk H, Barrault L, et al. Very Deep Convolutional Networks for Text Classification[OL]. arXiv Preprint, arXiv:1606.01781.
[19] Amodei D, Ananthanarayanan S, Anubhai R, et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin[C]// Proceedings of the 33rd International Conference on Machine Learning. 2016: 173-182.
[20] Polzlbauer G, Lidy T, Rauber A. Decision Manifolds—A Supervised Learning Algorithm Based on Self-organization[J]. IEEE Transactions on Neural Networks, 2008, 19(9): 1518-1530.
doi: 10.1109/TNN.2008.2000449 pmid: 18779085
[21] Dietterich T G. Machine Learning Research: Four Current Directions[J]. Artificial Intelligence Magazine, 1997, 18(4): 97-136.
[22] Friedl M A, Brodley C E. Decision Tree Classification of Land Cover from Remotely Sensed Data[J]. Remote Sensing of Environment, 1997, 61(3): 399-409.
doi: 10.1016/S0034-4257(97)00049-7
[23] Song S, Warren J, Riddle P. Developing High Risk Clusters for Chronic Disease Events with Classification Association Rule Mining[C]// Proceedings of the 7th Australasian Workshop on Health Informatics and Knowledge Management-Volume 153. 2014: 69-78.
[24] Rajliwall N S, Davey R, Chetty G. Machine Learning Based Models for Cardiovascular Risk Prediction[C]// Proceedings of 2018 International Conference on Machine Learning and Data Engineering. 2018: 142-148.
[25] Kuruvilla A M, Balaji N V. Heart Disease Prediction System Using Correlation Based Feature Selection with Multilayer Perceptron Approach[J]. IOP Conference Series: Materials Science and Engineering, 2021, 1085(1): 012028.
doi: 10.1088/1757-899X/1085/1/012028
[26] Dogan M V, Grumbach I M, Michaelson J J, et al. Integrated Genetic and Epigenetic Prediction of Coronary Heart Disease in the Framingham Heart Study[J]. PLoS One, 2018, 13(1): e0190549.
doi: 10.1371/journal.pone.0190549
[1] 韦华楠, 雷鸣, 汪雪锋, 余音. 基础研究资助导向识别及演化分析:以NSF为例[J]. 数据分析与知识发现, 2023, 7(5): 10-20.
[2] 林伟振, 刘洪伟, 陈燕君, 温展明, 易闽琦. 基于在线评论的顾客满意度研究——以健康监测穿戴产品为例*[J]. 数据分析与知识发现, 2023, 7(5): 145-154.
[3] 吕琦, 上官燕红, 张琳, 黄颖. 基于文本内容自动分类的跨学科测度研究*[J]. 数据分析与知识发现, 2023, 7(4): 56-67.
[4] 曲宗希, 沙勇忠, 李雨桐. 基于灰狼优化与多机器学习的重大传染病集合预测研究——以COVID-19疫情为例*[J]. 数据分析与知识发现, 2022, 6(8): 122-133.
[5] 赵杨, 严周周, 沈棋琦, 李钟航. 基于机器学习的医疗健康APP隐私政策合规性研究*[J]. 数据分析与知识发现, 2022, 6(5): 112-126.
[6] 王露, 乐小虬. 科技论文引用内容分析研究进展[J]. 数据分析与知识发现, 2022, 6(4): 1-15.
[7] 王若佳, 严承希, 郭凤英, 王继民. 基于用户画像的在线健康社区用户流失预测研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 80-92.
[8] 吴金红, 穆克亮. 国际期刊异常行为的自动识别与预警研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 385-395.
[9] 胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[10] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[11] 陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[12] 王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[13] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[14] 曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[15] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn