Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (9): 125-135    DOI: 10.11925/infotech.2096-3467.2022.0830
Current Issue | Archive | Adv Search |
Detecting Crowdfunding Frauds Based on Textual and Imbalanced Data
Xu Chen,Zhang Wei()
Business School, Central South University, Changsha 410083, China
Download: PDF (900 KB)   HTML ( 19
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper develops a new model to detect fraud in crowdfunding activities. [Methods] We extracted textual clues from the project description in three dimensions: cognitive load, narrative perspective, and emotional output. Then, we built and optimized ensemble models with resampling and threshold moving methods. [Results] The AUC values of the optimized models reached 0.8. The threshold moving method further improved the models’ performance, and the F1 scores improved by 0.279 on average, with a maximum improvement of 195%. [Limitations] The proposed models only use textual features from the project description and do not consider more dimensional features. [Conclusions] Ensemble model based on resampling and threshold moving methods can effectively identify fraudulent crowdfunding projects.

Key wordsDonation-Based Crowdfunding      Deceptive Discourse      Machine Learning      Imbalanced Classification     
Received: 07 August 2022      Published: 24 October 2023
ZTFLH:  TP391  
  G350  
Fund:The National Natural Science Foundation of China(71974207)
Corresponding Authors: Zhang Wei,ORCID: 0000-0002-6589-8312, E-mail:1147851113@qq.com。   

Cite this article:

Xu Chen, Zhang Wei. Detecting Crowdfunding Frauds Based on Textual and Imbalanced Data. Data Analysis and Knowledge Discovery, 2023, 7(9): 125-135.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0830     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I9/125

线索维度 具体变量 变量描述 理论或文献基础 所用LIWC单词类别
叙述视角 自我参照 第一人称代词(如I、me)占总词数的比例 IDT 1 s t ? p e r s ? s i n g u l a r 1 s t ? p e r s ? p l u r a l
他人参照 非第一人称代词(如you、he)的比例 IDT 2 n d ? p e r s o n 3 r d ? p e r s ? s i n g u l a r 3 r d ? p e r s ? p l u r a l
过去取向 过去时态相关词(如ago、did)的比例 Chua等[38] f o c u s p a s t
现在取向 现在时态相关词(如today、now)的比例 Chua等[38] f o c u s f u t u r e
将来取向 将来时态相关词(如will、soon)的比例 Chua等[38] f o c u s p r e s e n t
认知负载 文本长度 项目描述的总单词数 IDT W o r d ? c o u n t
词汇复杂度 超过6个字母的单词的比例 IDT W o r d s ? > ? 6 ? l e t t e r s
语句复杂度 平均每句单词个数 Braun等[42] W o r d s / s e n t e n c e
时间细节 时间相关词(如end、until)的比例 CBCA、IDT T i m e
空间细节 空间相关词(如down、in)的比例 IDT S p a c e
社交细节 社会关系相关词(mate、neighbor)的比例 IDT S o c i a l ? p r o c e s s e s
感知细节 知觉过程相关词(如look、heard)的比例 IDT P e r c e p t u a l ? p r o c e s s e s
确定性表达 确定性相关词(如always、never)的比例 CBCA、RM C e r t a i n t y
不确定性表达 不确定性词(如maybe、perhaps)的比例 CBCA、RM T e n t a t i v e
否定表达 否定词(如no、never)的比例 Newman等[18] N e g a t i o n s
因果表达 因果相关词(如because、effect)的比例 RM C a u s a t i o n
情感输出 负面情感 负面情感词(如hurt、nasty)的比例 RM N e g a t i v e ? e m o t i o n
正面情感 正面情感词(如love、nice)的比例 RM P o s i t i v e ? e m o t i o n
Textual Cue List
Model Process
模型 Accuracy Recall Precision F1 AUC
BoderlineSMOTE+AdaBoost 0.961 0.118 0.286 0.167 0.660
BoderlineSMOTE+XGBoost 0.947 0.176 0.188 0.182 0.785
BoderlineSMOTE+Random Forest 0.924 0.412 0.194 0.264 0.812
RUS+AdaBoost 0.928 0.529 0.237 0.327 0.903
RUS+XGBoost 0.922 0.529 0.220 0.310 0.856
RUS+Random Forest 0.953 0.529 0.360 0.429 0.900
Easy Ensemble+AdaBoost 0.732 0.882 0.100 0.180 0.894
Easy Ensemble+XGBoost 0.713 0.882 0.094 0.169 0.889
Easy Ensemble+Random Forest 0.740 0.706 0.086 0.153 0.848
Model Performance with Default Classification Threshold
模型 Accuracy Recall Precision F1
BoderlineSMOTE+AdaBoost 0.920 0.294 0.147 0.196
BoderlineSMOTE+XGBoost 0.930 0.353 0.194 0.250
BoderlineSMOTE+Random Forest 0.949 0.353 0.286 0.316
RUS+AdaBoost 0.914 0.706 0.235 0.353
RUS+XGBoost 0.951 0.471 0.333 0.390
RUS+Random Forest 0.959 0.529 0.409 0.462
Easy Ensemble+AdaBoost 0.947 0.588 0.333 0.426
Easy Ensemble+XGBoost 0.973 0.353 0.667 0.462
Easy Ensemble+Random Forest 0.967 0.412 0.500 0.452
Model Performance with Optimal Classification Threshold
Feature Summary
Model Prediction on Sample Numbered 10
[1] Salido-Andres N, Rey-Garcia M, Alvarez-Gonzalez L I, et al. Mapping the Field of Donation-Based Crowdfunding for Charitable Causes: Systematic Review and Conceptual Framework[J]. VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 2021, 32(2): 288-302.
doi: 10.1007/s11266-020-00213-w
[2] Paulus T M, Roberts K R. Crowdfunding a “Real-Life Superhero”: The Construction of Worthy Bodies in Medical Campaign Narratives[J]. Discourse, Context & Media, 2018, 21: 64-72.
[3] 朱灏, 尹可丽, 杨李慧子. 面部表情与捐赠者-受益者关系对网络慈善众筹捐赠行为的影响[J]. 心理与行为研究, 2020, 18(4): 570-576.
[3] (Zhu Hao, Yin Keli, Yang-Li Huizi. Effects of Beneficiaries’ Facial Expressions and Donor-Beneficiary Relationship on Donations Towards Online Crowdfunding for Charity[J]. Studies of Psychology and Behavior, 2020, 18(4): 570-576.)
[4] Bassani G, Marinelli N, Vismara S. Crowdfunding in Healthcare[J]. The Journal of Technology Transfer, 2019, 44(4): 1290-1310.
doi: 10.1007/s10961-018-9663-7
[5] Zenone M, Snyder J. Fraud in Medical Crowdfunding: A Typology of Publicized Cases and Policy Recommendations[J]. Policy & Internet, 2019, 11(2): 215-234.
doi: 10.1002/poi3.v11.3
[6] Sumlin B. Reports: Man Fakes Cancer, Raises over $60,000 in Donations[EB/OL]. [2022-06-11]. https://okcfox.com/news/local/reports-man-fakes-cancer-raises-over-60000-in-donations.
[7] 中国日报网. 恶意筹款案例占比0.3% “水滴行者”风控系统全面上线[EB/OL]. (2020-11-23). [2022-06-11]. https://cn.chinadaily.com.cn/a/202011/23/WS5fbb3177a3101e7ce973109f.html.
[7] (China Daily. Malicious Fundraising Cases Accounted for 0.3% “Waterdrop Walker” Risk Control System was Fully Launched[EB/OL]. (2020-11-23). [2022-06-11]. https://cn.chinadaily.com.cn/a/202011/23/WS5fbb3177a3101e7ce973109f.html.)
[8] Lee C H, Bian Y, Karaouzene R, et al. Examining the Role of Narratives in Civic Crowdfunding: Linguistic Style and Message Substance[J]. Industrial Management & Data Systems, 2019, 119(7): 1492-1514.
[9] Vassell A, Crooks V A, Snyder J. What was Lost, Missing, Sought and Hoped for: Qualitatively Exploring Medical Crowdfunding Campaign Narratives for Lyme Disease[J]. Health: An Interdisciplinary Journal for the Social Study of Health, Illness and Medicine, 2021, 25(6): 707-721.
doi: 10.1177/1363459320912808
[10] Xu K B, Wang X Y. “Kindhearted People, Please Save My Family”: Narrative Strategies for New Media Medical Crowdfunding[J]. Health Communication, 2020, 35(13): 1605-1613.
doi: 10.1080/10410236.2019.1654173
[11] Zhao X, Mao Y S. The Identity Lies in the Words of Crowd-Funders: Help-Seekers’ Identity Construction in Chinese Online Medical Crowd-Funding Discourses[J]. Health Communication, 2023, 38(2): 363-370.
doi: 10.1080/10410236.2021.1951959
[12] 卜亚敏, 甄伟锋. 社交媒体公益平台议题的文本构建与表达[J]. 青年记者, 2020(20): 29-30.
[12] (Bu Yamin, Zhen Weifeng. Text Construction and Expression of Social Media Public Welfare Platform Issues[J]. Youth Journalist, 2020(20): 29-30.)
[13] Parhankangas A, Renko M. Linguistic Style and Crowdfunding Success Among Social and Commercial Entrepreneurs[J]. Journal of Business Venturing, 2017, 32(2): 215-236.
doi: 10.1016/j.jbusvent.2016.11.001
[14] Robiady N D, Windasari N A, Nita A. Customer Engagement in Online Social Crowdfunding: The Influence of Storytelling Technique on Donation Performance[J]. International Journal of Research in Marketing, 2021, 38(2): 492-500.
doi: 10.1016/j.ijresmar.2020.03.001
[15] Buller D B, Burgoon J K. Interpersonal Deception Theory[J]. Communication Theory, 1996, 6(3): 203-242.
doi: 10.1111/comt.1996.6.issue-3
[16] McCornack S A, Morrison K, Paik J E, et al. Information Manipulation Theory 2: A Propositional Theory of Deceptive Discourse Production[J]. Journal of Language and Social Psychology, 2014, 33(4): 348-377.
doi: 10.1177/0261927X14534656
[17] Bond C F, DePaulo B M. Accuracy of Deception Judgments[J]. Personality and Social Psychology Review, 2006, 10(3): 214-234.
pmid: 16859438
[18] Newman M L, Pennebaker J W, Berry D S, et al. Lying Words: Predicting Deception from Linguistic Styles[J]. Personality and Social Psychology Bulletin, 2003, 29(5): 665-675.
pmid: 15272998
[19] Ho S M, Hancock J T, Booth C, et al. Computer-Mediated Deception: Strategies Revealed by Language-Action Cues in Spontaneous Communication[J]. Journal of Management Information Systems, 2016, 33(2): 393-420.
doi: 10.1080/07421222.2016.1205924
[20] Shafqat W, Lee S, Malik S, et al. The Language of Deceivers: Linguistic Features of Crowdfunding Scams[C]// Proceedings of the 25th International Conference Companion on World Wide Web. New York: ACM, 2016: 99-100.
[21] Humpherys S L, Moffitt K C, Burns M B, et al. Identification of Fraudulent Financial Statements Using Linguistic Credibility Analysis[J]. Decision Support Systems, 2011, 50(3): 585-594.
doi: 10.1016/j.dss.2010.08.009
[22] Forsyth L, Anglim J. Using Text Analysis Software to Detect Deception in Written Short-Answer Questions in Employee Selection[J]. International Journal of Selection and Assessment, 2020, 28(3): 236-246.
doi: 10.1111/ijsa.v28.3
[23] Ho S M, Hancock J T. Context in a Bottle: Language-Action Cues in Spontaneous Computer-Mediated Deception[J]. Computers in Human Behavior, 2019, 91: 33-41.
doi: 10.1016/j.chb.2018.09.008
[24] Sepehri A, Markowitz D M, Duclos R. The Location of Maximum Emotion in Deceptive and Truthful Texts[J]. Social Psychological and Personality Science, 2020, 12(6): 996-1004.
doi: 10.1177/1948550620949730
[25] 邓莎莎, 张朋柱, 张晓燕, 等. 基于欺骗语言线索的虚假评论识别[J]. 系统管理学报, 2014, 23(2): 263-270.
[25] (Deng Shasha, Zhang Pengzhu, Zhang Xiaoyan, et al. Deception Detection Based on Fake Linguistic Cues[J]. Journal of Systems & Management, 2014, 23(2): 263-270.)
[26] Vrij A, Fisher R, Mann S, et al. A Cognitive Load Approach to Lie Detection[J]. Journal of Investigative Psychology and Offender Profiling, 2008, 5(1-2): 39-43.
doi: 10.1002/jip.v5:1/2
[27] Vrij A. Verbal Lie Detection Tools: Statement Validity Analysis, Reality Monitoring and Scientific Content Analysis[A]// Granhag P A, Vrij A, Verschuere B. Detecting Deception: Current Challenges and Cognitive Approaches[M]. Wiley-Blackwell, 2015: 3-35.
[28] Vrij A. Criteria-Based Content Analysis: A Qualitative Review of the First 37 Studies[J]. Psychology, Public Policy, and Law, 2005, 11(1): 3-41.
doi: 10.1037/1076-8971.11.1.3
[29] Johnson M K, Raye C L. Reality Monitoring[J]. Psychological Review, 1981, 88(1): 67-85.
doi: 10.1037/0033-295X.88.1.67
[30] Sporer S L. The Less Travelled Road to Truth: Verbal Cues in Deception Detection in Accounts of Fabricated and Self-Experienced Events[J]. Applied Cognitive Psychology, 1997, 11(5): 373-397.
doi: 10.1002/(ISSN)1099-0720
[31] Suchotzki K, Verschuere B, Van Bockstaele B, et al. Lying Takes Time: A Meta-Analysis on Reaction Time Measures of Deception[J]. Psychological Bulletin, 2017, 143(4): 428-453.
doi: 10.1037/bul0000087 pmid: 28182460
[32] Corley P C, Wedeking J. The (Dis)Advantage of Certainty: The Importance of Certainty in Language[J]. Law & Society Review, 2014, 48(1): 35-62.
doi: 10.1111/lasr.2014.48.issue-1
[33] Pezzuti T, Leonhardt J M, Warren C. Certainty in Language Increases Consumer Engagement on Social Media[J]. Journal of Interactive Marketing, 2021, 53: 32-46.
doi: 10.1016/j.intmar.2020.06.005
[34] Hancock J T, Curry L E, Goorha S, et al. On Lying and Being Lied To: A Linguistic Analysis of Deception in Computer-Mediated Communication[J]. Discourse Processes, 2007, 45(1): 1-23.
doi: 10.1080/01638530701739181
[35] Pennebaker J W. The Secret Life of Pronouns[J]. New Scientist, 2011, 211(2828): 42-45.
[36] Hauch V, Blandón-Gitlin I, Masip J, et al. Are Computers Effective Lie Detectors? A Meta-Analysis of Linguistic Cues to Deception[J]. Personality and Social Psychology Review, 2015, 19(4): 307-342.
doi: 10.1177/1088868314556539 pmid: 25387767
[37] Markowitz D M, Griffin D J. When Context Matters: How False, Truthful, and Genre-Related Communication Styles are Revealed in Language[J]. Psychology, Crime & Law, 2020, 26(3): 287-310.
[38] Chua A Y, Banerjee S. Linguistic Predictors of Rumor Veracity on the Internet[C]// Proceedings of the International Multi Conference of Engineers and Computer Scientists. 2016: 387-391.
[39] Petty R E, Schumann D W, Richman S A, et al. Positive Mood and Persuasion: Different Roles for Affect Under High- and Low-Elaboration Conditions[J]. Journal of Personality and Social Psychology, 1993, 64(1): 5-20.
doi: 10.1037/0022-3514.64.1.5
[40] Johnson M K, Bush J G, Mitchell K J. Interpersonal Reality Monitoring: Judging the Sources of Other People’s Memories[J]. Social Cognition, 1998, 16(2): 199-224.
doi: 10.1521/soco.1998.16.2.199
[41] Tausczik Y R, Pennebaker J W. The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods[J]. Journal of Language and Social Psychology, 2010, 29(1): 24-54.
doi: 10.1177/0261927X09351676
[42] Braun M T, Van Swol L M. Justifications Offered, Questions Asked, and Linguistic Patterns in Deceptive and Truthful Monetary Interactions[J]. Group Decision and Negotiation, 2016, 25(3): 641-661.
doi: 10.1007/s10726-015-9455-5
[43] 邱云飞, 郭蕾. 面向非均衡数据的糖尿病并发症预测[J]. 数据分析与知识发现, 2021, 5(2): 116-128.
[43] (Qiu Yunfei, Guo Lei. Predicting Diabetic Complications with Unbalanced Data[J]. Data Analysis and Knowledge Discovery, 2021, 5(2): 116-128.)
[44] Lundberg S M, Lee S-I. A Unified Approach to Interpreting Model Predictions[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 4768-4777.
[1] Jiang Linfu, Yuan Zhenming, Zhang Xingwei, Jiang Huaqiang, Sun Xiaoyan. Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet[J]. 数据分析与知识发现, 2023, 7(5): 133-144.
[2] Wei Huanan, Lei Ming, Wang Xuefeng, Yu Yin. Analyzing Evolution of Basic Research Funding Orientation: Case Study of NSF[J]. 数据分析与知识发现, 2023, 7(5): 10-20.
[3] Lin Weizhen, Liu Hongwei, Chen Yanjun, Wen Zhanming, Yi Minqi. Customer Satisfaction Modelling for Healthcare Wearable Devices Through Online Reviews[J]. 数据分析与知识发现, 2023, 7(5): 145-154.
[4] Lv Qi, Shangguan Yanhong, Zhang Lin, Huang Ying. Interdisciplinary Measurement Based on Automatic Classification of Text Content[J]. 数据分析与知识发现, 2023, 7(4): 56-67.
[5] Qu Zongxi, Sha Yongzhong, Li Yutong. Predicting Major Infectious Diseases Based on Grey Wolf Optimization and Multi-machine Learning: Case Study of COVID-19[J]. 数据分析与知识发现, 2022, 6(8): 122-133.
[6] Zhao Yang, Yan Zhouzhou, Shen Qiqi, Li Zhonghang. Evaluating Privacy Policy for Mobile Health APPs with Machine Learning[J]. 数据分析与知识发现, 2022, 6(5): 112-126.
[7] Wang Lu, Le Xiaoqiu. Research Progress on Citation Analysis of Scientific Papers[J]. 数据分析与知识发现, 2022, 6(4): 1-15.
[8] Wang Ruojia, Yan Chengxi, Guo Fengying, Wang Jimin. Predicting Churners of Online Health Communities Based on the User Persona[J]. 数据分析与知识发现, 2022, 6(2/3): 80-92.
[9] Wu Jinhong, Mu Keliang. Automatic Identifying Abnormal Behaviors of International Journals[J]. 数据分析与知识发现, 2022, 6(2/3): 385-395.
[10] Hu Yamin, Wu Xiaoyan, Chen Fang. Review of Technology Term Recognition Studies Based on Machine Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[11] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[12] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[13] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[14] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[15] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn