Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (3): 41-52    DOI: 10.11925/infotech.2096-3467.2023.0052
Current Issue | Archive | Adv Search |
Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP
Nie Hui(),Wu Xiaoyan
School of Information Management, Sun Yat-Sen University, Guangzhou 510275, China
Download: PDF (1617 KB)   HTML ( 9
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study constructs a predictive model for depression severity and explores its interpretability issues. We aim to improve the automated depression detection model’s reliability and practicality by analyzing Internet user-generated content. [Methods] First, we built a corpus by collecting depression-related medical consultations from the Good Doctor Online platform. Then, we extracted patients’ psychological features using C-LIWC, a psychology lexicon. Third, we predicted the patients’ conditions with the Gradient Boosting Tree algorithm. The study also incorporated the explainable machine learning method SHAP to interpret the new model. Through SHAP’s unique visualizations, we analyzed the complex relationship between patients’ age, gender, cognition, emotions, perceptions, social / family contexts, personal gains or losses, and the occurrence of depression. [Results] The psychological state of depression patients provided feedback on their condition. Utilizing psychological features extracted from consultation records effectively detected severe depression, with an accuracy of 86%. The SHAP reveals multiple effects of patients’ psychological features on depression. [Limitations] Limited by the corpus, predictions of depression severity were based only on single consultation records. Additionally, the model features were based on psychological dictionaries, while more elements related to the risk of depression could be included in the future. [Conclusions] Factors influencing the occurrence and development of depression are complex. Individual differences result in different effects of various characteristics on disease prediction. Building an automated diagnostic model for depression should focus on the model’s accuracy and enhance understanding of the model’s predictions.

Key wordsDepression Prediction      Online User-Generated Content      Interpretable Machine Learning      Light Gradient Boosting Machine     
Received: 27 January 2023      Published: 28 April 2023
ZTFLH:  TP391  
  G350  
Fund:Social Science Fund of Guangzhou(10000-42220402)
Corresponding Authors: Nie Hui,ORCID: 0000-0001-8567-3084,E-mail:issnh@mail.sysu.edu.cn。   

Cite this article:

Nie Hui, Wu Xiaoyan. Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP. Data Analysis and Knowledge Discovery, 2024, 8(3): 41-52.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0052     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I3/41

大类 小类 示例 词类说明
社会类 家庭词 孩子,父母,家人,丈夫 涵盖各类称呼词,从某种程度上反映人际交往和社会家庭关系
朋友词 朋友,同事,男友,邻居
人类词 老师,儿童,人群,新生儿
情感类 正向情绪词 开心,希望,舒服,兴奋 从正负两个角度描述患者情绪,负面情绪对应患者常表现出的焦虑、悲伤、绝望等情感状态
焦虑词 焦虑,失眠,压力,发抖
生气词 暴躁,生气,讨厌,伤害
悲伤词 低落,痛苦,自责,绝望
认知类 洞察词 感觉,喜欢,学习,发现 刻画患者对自身病情、诊疗等情况的认知状态、接纳态度和预期心理
因果词 导致,影响,加重,效果
差距词 希望,想要,后悔,本来
暂定词 有时候,倾向,时常,大部分
确切词 真的,信息,自信,证明
感知类 视觉词 眼睛,照片,黑暗,阳光 主要是患者对自身各类感官知觉的直接表述
听觉词 说话,声音,告诉,安静
感觉词 感觉,刺激,接触,酸痛
生理类 身体词 食欲,心脏,呼吸,神经 从身体器官、生理行为等角度对患者病症表征和生理状态进行描述,健康词对应一组与诊疗有关的词项
健康词 医院,症状,服药,住院
性词 怀孕,性欲,艾滋病,同性恋
摄食词 暴饮暴食,消化,抽烟,厌食
个体得失类 工作词 工作,学习,大学,考试 刻画患者的学习、工作、生活等多方面状态,其中成就词主要描述患者的诊疗效果
成就词 控制,效果,持续,改善
休闲词 休息,暑假,放松,周末
家庭词 睡觉,作业,家务,宠物
金钱词 购物,费用,股票,贷款
宗教词 灵魂,冥想,魔鬼,祈祷
Six Sets of Psycholinguistic Words Associated with Depression
变量类型 特征变量 数据类型 变量描述
被预测变量 Depression 分类变量 抑郁程度(0:轻度,1:中度,2:重度)
预测变量 Gender 二值变量 性别(0:男性,1:女性)
Age 数值变量 年龄
Socialize 社交与家庭
Emotion 情感
Cognition 认知
Perception 感知
Physiology 生理
Gains or
Losses
个人得失
The Features of Depression Severity Prediction Model
Research Framework
心理特征 特征变量名 均值 标准差 方差
社交与家庭 Socialize 0.090 0.105 0.011
情感 Emotion 0.098 0.077 0.006
认知 Cognition 0.321 0.151 0.023
感知 Perception 0.061 0.095 0.009
生理 Physiology 0.161 0.118 0.014
个人得失 Gains or Losses 0.145 0.122 0.015
Descriptive Statistics for Psycholinguistic Variables (N=2 950)
参数 参数意义 参数调节范围 说明
max_depth 树的最大深度 {3,4,5} 该参数为树的根节点到叶子节点的最大距离,不宜过大,选取较小的max_depth可减少训练时间,经实验确定
num_leaves 叶子数 {5,6,7,12,13,14,15,28,29,30,31} 叶子数与树的最大深度是2的指数关系,一般取num_leaves<2max_depth可有效避免过拟合
learning_rate 学习率 {0.1,0.05,0.01} 该参数影响训练的准确率,经实验确定
Tuning LightGBM Algorithm’s Parameters
抑郁程度 精准率/% 召回率/% F1值/%
轻度(Mild) 79 4 7
中度(Moderate) 60 49 54
重度(Severe) 53 86 66
Performance of the LightGBM Model
The Confusion Matrix for Predicting Depression Level
Explaining the Depression Prediction Model Using SHAP Values
Interaction Effect Analysis Based on SHAP Values
[1] World Health Organization. Depression[EB/OL]. (2022-11-19)[2023-03-31]. https://www.who.int/news-room/fact-sheets/detail/depression.
[2] 傅小兰, 张侃, 陈雪峰, 等. 心理健康蓝皮书:中国国民心理健康发展报告(2021-2022)[M]. 北京: 社会科学文献出版社,2023.
[2] (Fu Xiaolan, Zhang Kan, Chen Xuefeng, et al. Report on National Mental Health Development in China (2021-2022)[M]. Beijing: Social Sciences Academic Press, 2023.)
[3] Huang Y Q, Wang Y, Wang H, et al. Prevalence of Mental Disorders in China: A Cross-Sectional Epidemiological Study[J]. The Lancet Psychiatry, 2019, 6(3): 211-224.
doi: 10.1016/S2215-0366(18)30511-X
[4] Ren X W, Yu S C, Dong W L, et al. Burden of Depression in China, 1990-2017: Findings from the Global Burden of Disease Study 2017[J]. Journal of Affective Disorders, 2020, 268: 95-101.
doi: S0165-0327(20)30083-5 pmid: 32158012
[5] Yates A, Cohan A, Goharian N. Depression and Self-Harm Risk Assessment in Online Forums[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2968-2978.
[6] Eichstaedt J C, Smith R J, Merchant R M, et al. Facebook Language Predicts Depression in Medical Records[J]. Psychological and Cognitive Sciences, 2018, 115(44): 11203-11208.
[7] Shrestha A, Serra E, Spezzano F. Multi-Modal Social and Psycho-Linguistic Embedding via Recurrent Neural Networks to Identify Depressed Users in Online Forums[J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9(1): Article No.22.
[8] Tadesse M M, Lin H F, Xu B, et al. Detection of Suicide Ideation in Social Media Forums Using Deep Learning[J]. Algorithms, 2020, 13(1): Article No.7.
[9] Yang T T, Li F, Ji D H, et al. Fine-Grained Depression Analysis Based on Chinese Micro-Blog Reviews[J]. Information Processing & Management, 2021, 58(6): 102681.
doi: 10.1016/j.ipm.2021.102681
[10] Burdisso S G, Errecalde M, Montes-y-Gómez M. Using Text Classification to Estimate the Depression Level of Reddit Users[J]. Journal of Computer Science & Technology, 2021, 21(1): 1-10.
[11] Abed-Esfahani P, Howard D, Maslej M, et al. Transfer Learning for Depression: Early Detection and Severity Prediction from Social Media Postings[C]// Proceedings of the Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum. Cham: Springer, 2019.
[12] Burdisso S G, Errecalde M, Montes-y-Gómez M. τ-SS3: A Text Classifier with Dynamic N-Grams for Early Risk Detection over Text Streams[J]. Pattern Recognition Letters, 2020, 138: 130-137.
doi: 10.1016/j.patrec.2020.07.001
[13] Bucur A M, Cosma A, Dinu L P. Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT[OL]. [2022-12-17]. http://dx.doi.org/10.13140/RG.2.2.25060.50567.
[14] Parapar J, Martín-Rodilla P, Losada D E, et al. Overview of eRisk 2021: Early Risk Prediction on the Internet[C]// Proceedings of the Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum. Cham: Springer, 2021: 324-344
[15] Mi J X, Li A D, Zhou L F. Review Study of Interpretation Methods for Future Interpretable Machine Learning[J]. IEEE Access, 2020, 8: 191969-191985.
doi: 10.1109/Access.6287639
[16] Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Cham: Springer, 2017: 3149-3157.
[17] Lundberg S M, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees[J]. Nature Machine Intelligence, 2020, 2(1): 56-67.
doi: 10.1038/s42256-019-0138-9 pmid: 32607472
[18] Yao X X, Yu G, Tang J Y, et al. Extracting Depressive Symptoms and Their Associations from an Online Depression Community[J]. Computers in Human Behavior, 2021, 120: 106734.
doi: 10.1016/j.chb.2021.106734
[19] Chung C K, Pennebaker J W. Linguistic Inquiry and Word Count (LIWC): Pronounced "Luke,"... and Other Useful Facts[OL]. [2022-12-17]. https://doi.org/10.4018/978-1-60960-741-8.ch012.
[20] Fatima I, Abbasi B U D, Khan S, et al. Prediction of Postpartum Depression Using Machine Learning Techniques from Social Media Text[J]. Expert Systems, 2019, 36(4): e12409.
doi: 10.1111/exsy.v36.4
[21] Lyons M, Aksayli N D, Brewer G. Mental Distress and Language Use: Linguistic Analysis of Discussion Forum Posts[J]. Computers in Human Behavior, 2018, 87: 207-211.
doi: 10.1016/j.chb.2018.05.035
[22] Uban A S, Chulvi B, Rosso P. An Emotion and Cognitive Based Analysis of Mental Health Disorders from Social Media Data[J]. Future Generation Computer Systems, 2021, 124: 480-494.
doi: 10.1016/j.future.2021.05.032
[23] Hyde J S, Mezulis A H. Gender Differences in Depression: Biological, Affective, Cognitive, and Sociocultural Factors[J]. Harvard Review of Psychiatry, 2020, 28(1): 4-13.
doi: 10.1097/HRP.0000000000000230
[24] 好大夫在线简介[EB/OL]. [2022-10-12]. https://www.haodf.com/info/aboutus.php.(Introduction [EB/OL]. [2022-10-12]. https://www.haodf.com/info/aboutus.php.)
[25] Zhao N, Jiao D D, Bai S T, et al. Evaluating the Validity of Simplified Chinese Version of LIWC in Detecting Psychological Expressions in Short Texts on Social Network Services[J]. PLoS One, 2016, 11(6): e0157947.
doi: 10.1371/journal.pone.0157947
[26] Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794.
[27] Shapley L S. A Value for n-Person Games. Contributions to the Theory of Games[M]. Princeton: Princeton University Press, 1953: 307-317.
[28] Molnar C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable[EB/OL]. [2022-12-17]. https://christophm.github.io/interpretable-ml-book/.
[29] Moncada-Torres A, van Maaren M C, Hendriks M P, et al. Explainable Machine Learning can Outperform Cox Regression Predictions and Provide Insights in Breast Cancer Survival[J]. Scientific Reports, 2021, 11: Article No.6968.
[30] Adadi A, Berrada M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)[J]. IEEE Access, 2018, 6: 52138-52160.
doi: 10.1109/ACCESS.2018.2870052
[1] Liu Tianchang, Wang Lei, Zhu Qinghua. Predicting User Churn of Smart Home-based Care Services Based on SHAP Interpretation[J]. 数据分析与知识发现, 2024, 8(1): 40-54.
[2] Liu Zhifeng, Wang Jimin. Review of Interpretable Machine Learning for Information Resource Management[J]. 数据分析与知识发现, 2024, 8(1): 16-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn