[Objective] This study constructs a predictive model for depression severity and explores its interpretability issues. We aim to improve the automated depression detection model’s reliability and practicality by analyzing Internet user-generated content. [Methods] First, we built a corpus by collecting depression-related medical consultations from the Good Doctor Online platform. Then, we extracted patients’ psychological features using C-LIWC, a psychology lexicon. Third, we predicted the patients’ conditions with the Gradient Boosting Tree algorithm. The study also incorporated the explainable machine learning method SHAP to interpret the new model. Through SHAP’s unique visualizations, we analyzed the complex relationship between patients’ age, gender, cognition, emotions, perceptions, social / family contexts, personal gains or losses, and the occurrence of depression. [Results] The psychological state of depression patients provided feedback on their condition. Utilizing psychological features extracted from consultation records effectively detected severe depression, with an accuracy of 86%. The SHAP reveals multiple effects of patients’ psychological features on depression. [Limitations] Limited by the corpus, predictions of depression severity were based only on single consultation records. Additionally, the model features were based on psychological dictionaries, while more elements related to the risk of depression could be included in the future. [Conclusions] Factors influencing the occurrence and development of depression are complex. Individual differences result in different effects of various characteristics on disease prediction. Building an automated diagnostic model for depression should focus on the model’s accuracy and enhance understanding of the model’s predictions.
聂卉, 吴晓燕. 结合梯度提升树算法与可解释机器学习模型SHAP的抑郁症影响因素研究*[J]. 数据分析与知识发现, 2024, 8(3): 41-52.
Nie Hui, Wu Xiaoyan. Detecting Depression Factors with Gradient Boosting Tree and Explainable Machine Learning Model SHAP. Data Analysis and Knowledge Discovery, 2024, 8(3): 41-52.
(Fu Xiaolan, Zhang Kan, Chen Xuefeng, et al. Report on National Mental Health Development in China (2021-2022)[M]. Beijing: Social Sciences Academic Press, 2023.)
[3]
Huang Y Q, Wang Y, Wang H, et al. Prevalence of Mental Disorders in China: A Cross-Sectional Epidemiological Study[J]. The Lancet Psychiatry, 2019, 6(3): 211-224.
doi: 10.1016/S2215-0366(18)30511-X
[4]
Ren X W, Yu S C, Dong W L, et al. Burden of Depression in China, 1990-2017: Findings from the Global Burden of Disease Study 2017[J]. Journal of Affective Disorders, 2020, 268: 95-101.
doi: S0165-0327(20)30083-5
pmid: 32158012
[5]
Yates A, Cohan A, Goharian N. Depression and Self-Harm Risk Assessment in Online Forums[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2017: 2968-2978.
[6]
Eichstaedt J C, Smith R J, Merchant R M, et al. Facebook Language Predicts Depression in Medical Records[J]. Psychological and Cognitive Sciences, 2018, 115(44): 11203-11208.
[7]
Shrestha A, Serra E, Spezzano F. Multi-Modal Social and Psycho-Linguistic Embedding via Recurrent Neural Networks to Identify Depressed Users in Online Forums[J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9(1): Article No.22.
[8]
Tadesse M M, Lin H F, Xu B, et al. Detection of Suicide Ideation in Social Media Forums Using Deep Learning[J]. Algorithms, 2020, 13(1): Article No.7.
[9]
Yang T T, Li F, Ji D H, et al. Fine-Grained Depression Analysis Based on Chinese Micro-Blog Reviews[J]. Information Processing & Management, 2021, 58(6): 102681.
doi: 10.1016/j.ipm.2021.102681
[10]
Burdisso S G, Errecalde M, Montes-y-Gómez M. Using Text Classification to Estimate the Depression Level of Reddit Users[J]. Journal of Computer Science & Technology, 2021, 21(1): 1-10.
[11]
Abed-Esfahani P, Howard D, Maslej M, et al. Transfer Learning for Depression: Early Detection and Severity Prediction from Social Media Postings[C]// Proceedings of the Working Notes of CLEF 2019-Conference and Labs of the Evaluation Forum. Cham: Springer, 2019.
[12]
Burdisso S G, Errecalde M, Montes-y-Gómez M. τ-SS3: A Text Classifier with Dynamic N-Grams for Early Risk Detection over Text Streams[J]. Pattern Recognition Letters, 2020, 138: 130-137.
doi: 10.1016/j.patrec.2020.07.001
[13]
Bucur A M, Cosma A, Dinu L P. Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT[OL]. [2022-12-17]. http://dx.doi.org/10.13140/RG.2.2.25060.50567.
[14]
Parapar J, Martín-Rodilla P, Losada D E, et al. Overview of eRisk 2021: Early Risk Prediction on the Internet[C]// Proceedings of the Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum. Cham: Springer, 2021: 324-344
[15]
Mi J X, Li A D, Zhou L F. Review Study of Interpretation Methods for Future Interpretable Machine Learning[J]. IEEE Access, 2020, 8: 191969-191985.
doi: 10.1109/Access.6287639
[16]
Ke G L, Meng Q, Finley T, et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Cham: Springer, 2017: 3149-3157.
[17]
Lundberg S M, Erion G, Chen H, et al. From Local Explanations to Global Understanding with Explainable AI for Trees[J]. Nature Machine Intelligence, 2020, 2(1): 56-67.
doi: 10.1038/s42256-019-0138-9
pmid: 32607472
[18]
Yao X X, Yu G, Tang J Y, et al. Extracting Depressive Symptoms and Their Associations from an Online Depression Community[J]. Computers in Human Behavior, 2021, 120: 106734.
doi: 10.1016/j.chb.2021.106734
[19]
Chung C K, Pennebaker J W. Linguistic Inquiry and Word Count (LIWC): Pronounced "Luke,"... and Other Useful Facts[OL]. [2022-12-17]. https://doi.org/10.4018/978-1-60960-741-8.ch012.
[20]
Fatima I, Abbasi B U D, Khan S, et al. Prediction of Postpartum Depression Using Machine Learning Techniques from Social Media Text[J]. Expert Systems, 2019, 36(4): e12409.
doi: 10.1111/exsy.v36.4
[21]
Lyons M, Aksayli N D, Brewer G. Mental Distress and Language Use: Linguistic Analysis of Discussion Forum Posts[J]. Computers in Human Behavior, 2018, 87: 207-211.
doi: 10.1016/j.chb.2018.05.035
[22]
Uban A S, Chulvi B, Rosso P. An Emotion and Cognitive Based Analysis of Mental Health Disorders from Social Media Data[J]. Future Generation Computer Systems, 2021, 124: 480-494.
doi: 10.1016/j.future.2021.05.032
[23]
Hyde J S, Mezulis A H. Gender Differences in Depression: Biological, Affective, Cognitive, and Sociocultural Factors[J]. Harvard Review of Psychiatry, 2020, 28(1): 4-13.
doi: 10.1097/HRP.0000000000000230
Zhao N, Jiao D D, Bai S T, et al. Evaluating the Validity of Simplified Chinese Version of LIWC in Detecting Psychological Expressions in Short Texts on Social Network Services[J]. PLoS One, 2016, 11(6): e0157947.
doi: 10.1371/journal.pone.0157947
[26]
Chen T Q, Guestrin C. XGBoost: A Scalable Tree Boosting System[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press, 2016: 785-794.
[27]
Shapley L S. A Value for n-Person Games. Contributions to the Theory of Games[M]. Princeton: Princeton University Press, 1953: 307-317.
[28]
Molnar C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable[EB/OL]. [2022-12-17]. https://christophm.github.io/interpretable-ml-book/.
[29]
Moncada-Torres A, van Maaren M C, Hendriks M P, et al. Explainable Machine Learning can Outperform Cox Regression Predictions and Provide Insights in Breast Cancer Survival[J]. Scientific Reports, 2021, 11: Article No.6968.
[30]
Adadi A, Berrada M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)[J]. IEEE Access, 2018, 6: 52138-52160.
doi: 10.1109/ACCESS.2018.2870052