Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (3): 41-52     https://doi.org/10.11925/infotech.2096-3467.2023.0052
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
结合梯度提升树算法与可解释机器学习模型SHAP的抑郁症影响因素研究*
聂卉(),吴晓燕
中山大学信息管理学院 广州 510275
Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP
Nie Hui(),Wu Xiaoyan
School of Information Management, Sun Yat-Sen University, Guangzhou 510275, China
全文: PDF (1617 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】本研究旨在探讨构建抑郁严重度预测模型及其解释性问题,通过分析互联网用户生成的内容,进一步发展抑郁症风险预测研究,从而提高抑郁症自动检测模型的可靠性和实用性。【方法】通过收集“好大夫在线”平台上的抑郁症医疗咨询文本记录,构建了一个语料库。利用心理学词典,从中提取了患者的心理特征,并采用梯度提升树算法预测患者的病情,同时引入可解释机器学习方法SHAP解读模型,借助SHAP独特的可视化图表剖析患者年龄、性别、认知、情感、感知、社会家庭及个人得失与抑郁症发生之间的复杂关系。【结果】抑郁症患者心理状态能反馈患者病况,利用从患者问诊记录中提取的心理特征能够有效检测重度抑郁,准确率达到86%。可解释机器学习模型SHAP解释了模型的预测结果,揭示出患者各层面心理特征对抑郁症发生产生的多重效应。【局限】受语料集所限,仅利用单次问诊记录对抑郁程度做预测;而模型特征基于心理学词典,更多与抑郁症发生风险有关的要素可纳入建模考虑中。【结论】影响抑郁症产生及发展的因素复杂。个体差异致使各项特征对于疾病预测产生不同效应。构建抑郁症的自动诊断模型,不仅要关注模型的精准度,更需增强对模型预测的理解。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
聂卉
吴晓燕
关键词 抑郁症预测在线用户生成内容可解释机器学习梯度提升树算法    
Abstract

[Objective] This study constructs a predictive model for depression severity and explores its interpretability issues. We aim to improve the automated depression detection model’s reliability and practicality by analyzing Internet user-generated content. [Methods] First, we built a corpus by collecting depression-related medical consultations from the Good Doctor Online platform. Then, we extracted patients’ psychological features using C-LIWC, a psychology lexicon. Third, we predicted the patients’ conditions with the Gradient Boosting Tree algorithm. The study also incorporated the explainable machine learning method SHAP to interpret the new model. Through SHAP’s unique visualizations, we analyzed the complex relationship between patients’ age, gender, cognition, emotions, perceptions, social / family contexts, personal gains or losses, and the occurrence of depression. [Results] The psychological state of depression patients provided feedback on their condition. Utilizing psychological features extracted from consultation records effectively detected severe depression, with an accuracy of 86%. The SHAP reveals multiple effects of patients’ psychological features on depression. [Limitations] Limited by the corpus, predictions of depression severity were based only on single consultation records. Additionally, the model features were based on psychological dictionaries, while more elements related to the risk of depression could be included in the future. [Conclusions] Factors influencing the occurrence and development of depression are complex. Individual differences result in different effects of various characteristics on disease prediction. Building an automated diagnostic model for depression should focus on the model’s accuracy and enhance understanding of the model’s predictions.

Key wordsDepression Prediction    Online User-Generated Content    Interpretable Machine Learning    Light Gradient Boosting Machine
收稿日期: 2023-01-27      出版日期: 2023-04-28
ZTFLH:  TP391  
  G350  
基金资助:* 2022广州社会科学基金项目(10000-42220402)
通讯作者: 聂卉,ORCID: 0000-0001-8567-3084,E-mail:issnh@mail.sysu.edu.cn。   
引用本文:   
聂卉, 吴晓燕. 结合梯度提升树算法与可解释机器学习模型SHAP的抑郁症影响因素研究*[J]. 数据分析与知识发现, 2024, 8(3): 41-52.
Nie Hui, Wu Xiaoyan. Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP. Data Analysis and Knowledge Discovery, 2024, 8(3): 41-52.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0052      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I3/41
大类 小类 示例 词类说明
社会类 家庭词 孩子,父母,家人,丈夫 涵盖各类称呼词,从某种程度上反映人际交往和社会家庭关系
朋友词 朋友,同事,男友,邻居
人类词 老师,儿童,人群,新生儿
情感类 正向情绪词 开心,希望,舒服,兴奋 从正负两个角度描述患者情绪,负面情绪对应患者常表现出的焦虑、悲伤、绝望等情感状态
焦虑词 焦虑,失眠,压力,发抖
生气词 暴躁,生气,讨厌,伤害
悲伤词 低落,痛苦,自责,绝望
认知类 洞察词 感觉,喜欢,学习,发现 刻画患者对自身病情、诊疗等情况的认知状态、接纳态度和预期心理
因果词 导致,影响,加重,效果
差距词 希望,想要,后悔,本来
暂定词 有时候,倾向,时常,大部分
确切词 真的,信息,自信,证明
感知类 视觉词 眼睛,照片,黑暗,阳光 主要是患者对自身各类感官知觉的直接表述
听觉词 说话,声音,告诉,安静
感觉词 感觉,刺激,接触,酸痛
生理类 身体词 食欲,心脏,呼吸,神经 从身体器官、生理行为等角度对患者病症表征和生理状态进行描述,健康词对应一组与诊疗有关的词项
健康词 医院,症状,服药,住院
性词 怀孕,性欲,艾滋病,同性恋
摄食词 暴饮暴食,消化,抽烟,厌食
个体得失类 工作词 工作,学习,大学,考试 刻画患者的学习、工作、生活等多方面状态,其中成就词主要描述患者的诊疗效果
成就词 控制,效果,持续,改善
休闲词 休息,暑假,放松,周末
家庭词 睡觉,作业,家务,宠物
金钱词 购物,费用,股票,贷款
宗教词 灵魂,冥想,魔鬼,祈祷
Table 1  6大类与抑郁症有关的心理语言学词汇
变量类型 特征变量 数据类型 变量描述
被预测变量 Depression 分类变量 抑郁程度(0:轻度,1:中度,2:重度)
预测变量 Gender 二值变量 性别(0:男性,1:女性)
Age 数值变量 年龄
Socialize 社交与家庭
Emotion 情感
Cognition 认知
Perception 感知
Physiology 生理
Gains or
Losses
个人得失
Table 2  抑郁症程度预测模型的特征变量
Fig.1  研究框架
心理特征 特征变量名 均值 标准差 方差
社交与家庭 Socialize 0.090 0.105 0.011
情感 Emotion 0.098 0.077 0.006
认知 Cognition 0.321 0.151 0.023
感知 Perception 0.061 0.095 0.009
生理 Physiology 0.161 0.118 0.014
个人得失 Gains or Losses 0.145 0.122 0.015
Table 3  抑郁症患者心理层面特征的描述统计(样例数N=2 950)
参数 参数意义 参数调节范围 说明
max_depth 树的最大深度 {3,4,5} 该参数为树的根节点到叶子节点的最大距离,不宜过大,选取较小的max_depth可减少训练时间,经实验确定
num_leaves 叶子数 {5,6,7,12,13,14,15,28,29,30,31} 叶子数与树的最大深度是2的指数关系,一般取num_leaves<2max_depth可有效避免过拟合
learning_rate 学习率 {0.1,0.05,0.01} 该参数影响训练的准确率,经实验确定
Table 4  LightGBM算法的参数调节
抑郁程度 精准率/% 召回率/% F1值/%
轻度(Mild) 79 4 7
中度(Moderate) 60 49 54
重度(Severe) 53 86 66
Table 5  基于LightGBM的预测模型表现
Fig.2  抑郁症程度预测的混淆矩阵
Fig.3  运用SHAP值解释抑郁症预测模型
Fig.4  基于SHAP值的交互效应分析
[1] World Health Organization. Depression[EB/OL]. (2022-11-19)[2023-03-31]. https://www.who.int/news-room/fact-sheets/detail/depression.
[2] 傅小兰, 张侃, 陈雪峰, 等. 心理健康蓝皮书:中国国民心理健康发展报告(2021-2022)[M]. 北京: 社会科学文献出版社,2023.
[2] (Fu Xiaolan, Zhang Kan, Chen Xuefeng, et al. Report on National Mental Health Development in China (2021-2022)[M]. Beijing: Social Sciences Academic Press, 2023.)
[3] Huang Y Q, Wang Y, Wang H, et al. Prevalence of Mental Disorders in China: A Cross-Sectional Epidemiological Study[J]. The Lancet Psychiatry, 2019, 6(3): 211-224.
doi: 10.1016/S2215-0366(18)30511-X
[4] Ren X W, Yu S C, Dong W L, et al. Burden of Depression in China, 1990-2017: Findings from the Global Burden of Disease Study 2017[J]. Journal of Affective Disorders, 2020, 268: 95-101.
doi: S0165-0327(20)30083-5 pmid: 32158012
[5] Yates A, Cohan A, Goharian N. Depression and Self-Harm Risk Assessment in Online Forums[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2968-2978.
[6] Eichstaedt J C, Smith R J, Merchant R M, et al. Facebook Language Predicts Depression in Medical Records[J]. Psychological and Cognitive Sciences, 2018, 115(44): 11203-11208.
[7] Shrestha A, Serra E, Spezzano F. Multi-Modal Social and Psycho-Linguistic Embedding via Recurrent Neural Networks to Identify Depressed Users in Online Forums[J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9(1): Article No.22.
[8] Tadesse M M, Lin H F, Xu B, et al. Detection of Suicide Ideation in Social Media Forums Using Deep Learning[J]. Algorithms, 2020, 13(1): Article No.7.
[9] Yang T T, Li F, Ji D H, et al. Fine-Grained Depression Analysis Based on Chinese Micro-Blog Reviews[J]. Information Processing & Management, 2021, 58(6): 102681.
doi: 10.1016/j.ipm.2021.102681
[10] Burdisso S G, Errecalde M, Montes-y-Gómez M. Using Text Classification to Estimate the Depression Level of Reddit Users[J]. Journal of Computer Science & Technology, 2021, 21(1): 1-10.
[11] Abed-Esfahani P, Howard D, Maslej M, et al. Transfer Learning for Depression: Early Detection and Severity Prediction from Social Media Postings[C]// Proceedings of the Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum. Cham: Springer, 2019.
[12] Burdisso S G, Errecalde M, Montes-y-Gómez M. τ-SS3: A Text Classifier with Dynamic N-Grams for Early Risk Detection over Text Streams[J]. Pattern Recognition Letters, 2020, 138: 130-137.
doi: 10.1016/j.patrec.2020.07.001
[13] Bucur A M, Cosma A, Dinu L P. Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT[OL]. [2022-12-17]. http://dx.doi.org/10.13140/RG.2.2.25060.50567.
[14] Parapar J, Martín-Rodilla P, Losada D E, et al. Overview of eRisk 2021: Early Risk Prediction on the Internet[C]// Proceedings of the Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum. Cham: Springer, 2021: 324-344
[15] Mi J X, Li A D, Zhou L F. Review Study of Interpretation Methods for Future Interpretable Machine Learning[J]. IEEE Access, 2020, 8: 191969-191985.
doi: 10.1109/Access.6287639
[16] Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Cham: Springer, 2017: 3149-3157.
[17] Lundberg S M, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees[J]. Nature Machine Intelligence, 2020, 2(1): 56-67.
doi: 10.1038/s42256-019-0138-9 pmid: 32607472
[18] Yao X X, Yu G, Tang J Y, et al. Extracting Depressive Symptoms and Their Associations from an Online Depression Community[J]. Computers in Human Behavior, 2021, 120: 106734.
doi: 10.1016/j.chb.2021.106734
[19] Chung C K, Pennebaker J W. Linguistic Inquiry and Word Count (LIWC): Pronounced "Luke,"... and Other Useful Facts[OL]. [2022-12-17]. https://doi.org/10.4018/978-1-60960-741-8.ch012.
[20] Fatima I, Abbasi B U D, Khan S, et al. Prediction of Postpartum Depression Using Machine Learning Techniques from Social Media Text[J]. Expert Systems, 2019, 36(4): e12409.
doi: 10.1111/exsy.v36.4
[21] Lyons M, Aksayli N D, Brewer G. Mental Distress and Language Use: Linguistic Analysis of Discussion Forum Posts[J]. Computers in Human Behavior, 2018, 87: 207-211.
doi: 10.1016/j.chb.2018.05.035
[22] Uban A S, Chulvi B, Rosso P. An Emotion and Cognitive Based Analysis of Mental Health Disorders from Social Media Data[J]. Future Generation Computer Systems, 2021, 124: 480-494.
doi: 10.1016/j.future.2021.05.032
[23] Hyde J S, Mezulis A H. Gender Differences in Depression: Biological, Affective, Cognitive, and Sociocultural Factors[J]. Harvard Review of Psychiatry, 2020, 28(1): 4-13.
doi: 10.1097/HRP.0000000000000230
[24] 好大夫在线简介[EB/OL]. [2022-10-12]. https://www.haodf.com/info/aboutus.php.(Introduction [EB/OL]. [2022-10-12]. https://www.haodf.com/info/aboutus.php.)
[25] Zhao N, Jiao D D, Bai S T, et al. Evaluating the Validity of Simplified Chinese Version of LIWC in Detecting Psychological Expressions in Short Texts on Social Network Services[J]. PLoS One, 2016, 11(6): e0157947.
doi: 10.1371/journal.pone.0157947
[26] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794.
[27] Shapley L S. A Value for n-Person Games. Contributions to the Theory of Games[M]. Princeton: Princeton University Press, 1953: 307-317.
[28] Molnar C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable[EB/OL]. [2022-12-17]. https://christophm.github.io/interpretable-ml-book/.
[29] Moncada-Torres A, van Maaren M C, Hendriks M P, et al. Explainable Machine Learning can Outperform Cox Regression Predictions and Provide Insights in Breast Cancer Survival[J]. Scientific Reports, 2021, 11: Article No.6968.
[30] Adadi A, Berrada M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)[J]. IEEE Access, 2018, 6: 52138-52160.
doi: 10.1109/ACCESS.2018.2870052
[1] 刘智锋, 王继民. 可解释机器学习在信息资源管理领域的应用研究综述*[J]. 数据分析与知识发现, 2024, 8(1): 16-29.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn