Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (6): 113-122    DOI: 10.11925/infotech.2096-3467.2022.0442
Current Issue | Archive | Adv Search |
Unbalanced Fake Review Processing Model Based on Cost-Sensitive Learning
Liu Meiling1(),Shang Yue1,Zhao Tiejun2,Zhou Jiyun3
1School of Information and Computer Engineering, Northeast Forestry University, Harbin 150006, China
2School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
3Lieber Institute, Johns Hopkins University, Baltimore, MD 21218, USA
Download: PDF (1010 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to enhance the detection of fake reviews by improving the model’s ability to learn deep semantic information from text and addressing the problem of data imbalance. [Methods] User behavior and text characteristics of the dataset were analyzed to automatically calculate a cost-sensitive matrix based on inter-class separability, thereby improving the model’s ability to learn from unbalanced data. Additionally, the text encoding ability of BERT was utilized to optimize the model further. [Results] Extensive experiments on the YelpCHI dataset showed that the proposed model outperformed existing advanced methods with an 18% improvement in F1 value and a 12% improvement in AUC value. [Limitations] While the proposed method has achieved promising results, further research is needed to explore its applicability to other domains. [Conclusions] Leveraging user behavior and text features for category separability calculation effectively enhances the performance of the model in detecting fake reviews. The proposed method’s integration of cost-sensitive matrix and BERT’s text encoding ability holds great potential for improving the detection of fake reviews.

Key wordsFake Review Detection      Class Separability Computation      Cost-Sensitive Learning      Unbalanced Data Processing     
Received: 06 May 2022      Published: 09 August 2023
ZTFLH:  TP393  
  G250  
Fund:Natural Science Foundation of Heilongjiang Province(LH2022F002);National Natural Science Foundation of China(61702091)
Corresponding Authors: Liu Meiling,ORCID:0000-0003-4208-7274,E-mail: mlliu@nefu.edu.cn。   

Cite this article:

Liu Meiling, Shang Yue, Zhao Tiejun, Zhou Jiyun. Unbalanced Fake Review Processing Model Based on Cost-Sensitive Learning. Data Analysis and Knowledge Discovery, 2023, 7(6): 113-122.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0442     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I6/113

标注方法 数据集名称 数据集描述 数据集领域
基于规则 Amazon product review 5.8MB评论,2.15MB用户 书籍、DVD、音乐、产品
Amazon book review 6.8KB论,4.8KB用户 书籍
TripAdvisor booking review 2.8KB评论 旅馆
基于人力 Epinion 6KB评论 产品
TripAdvisor 3KB评论 旅馆
基于算法过滤 Yelp hotel 67.4KB评论,38KB用户 旅馆和餐厅
Yelp hotel 3.59MB评论,16KB用户 旅馆和餐厅
Yelp hotel 6.09MB评论,9KB用户 旅馆和餐厅
点评 10KB评论,9KB用户 餐厅
Amazon 众包 TripAdvisor Hotel 1 600评论 旅馆
TripAdvisor multidomain 3 032评论 旅馆
Fake Review Detection Research Dataset Description
The Model Structure of MIANA-B
用户行为特征 描述
评分偏差(RD)[14] 计算单用户评分与产品平均评分的偏差
评分偏差阈值(DEV)[15] 计算RD归一化的偏差
极端评分(EXT)[15] 判断用户评分是否极端化
单日最多评论数(MNR)[16] 判断用户是否为单日最多评论发布者
好评率(PR)[17] 计算用户发表好评的概率
差评率(NR)[17] 计算用户发表差评的概率
User Behavior Characteristics
评论句子 特征集
I went here on Friday night during Restaurant Week. The lamb meat was excellent. The fillet mignon wasn’t as good as I had hoped. I loved their pao de queijo. However, I was disappointed when the server brought the banana cream pie. It was too sweet, and we didn’t even get a chance to choose our dessert. [-1.0027, 0.0, 0.0, 1.0, 0.4, 0.0, 11.0, 0.0, 67.0, 4.0, 4.0]
My girlfriend and I reserve this as a “special” dinner place for certain occasions. Tremendous neopolitan pizza, great insalata mista, and easily the best limoncello anywhere. It’s tough to get a table at prime time on a weekend - go there on a weeknight. One of the fun things about going here for is how relaxed and happy all the patrons seem. The owner is very nice, too, and will visit your table if things aren’t too busy. [1.0358, 0.0, 1.0, 1.0, 1.0, 0.0, 6.0, 0.0, 91.0, 1.0, 2.0]
Descriptive Statistics of User Behavior Characteristics and Review Text Characteristics
比较项目 YelpCHI
真实评论 虚假评论
评论数量 58 476 8 919
数据占比 86.70% 13.20%
平均长度 170 120
好评占比 74.29% 70.34%
差评占比 25.71% 29.66%
Statistics Description of YelpCHI Dataset
模型 F1/% AUC/%
Player2Vec 46.08 54.03
GraphConsis 57.76 74.28
En-HGAN 57.82 75.16
MIANA 64.78 86.65
MIANA-BERT 65.03 86.68
MIANA-C 68.83 86.37
MIANA+Focal Loss 67.83 86.04
Ours:MIANA-B 75.97 87.34
MIANA-B Experimental Results
Precision of MIANA-B and Comparison Models for Two Categories
Recall of MIANA-B and Comparison Models for Two Categories
F1-Score of MIANA-B and Comparison Models for Two Categories
[1] Bajaj S, Garg N, Singh S K. A Novel User-Based Spam Review Detection[J]. Procedia Computer Science, 2017, 122: 1009-1015.
doi: 10.1016/j.procs.2017.11.467
[2] Felbermayr A, Nanopoulos A. The Role of Emotions for the Perceived Usefulness in Online Customer Reviews[J]. Journal of Interactive Marketing, 2016, 36: 60-76.
doi: 10.1016/j.intmar.2016.05.004
[3] Jindal N, Liu B. Review Spam Detection[C]// Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 1189-1190.
[4] Liu M L, Shang Y, Yue Q, et al. Detecting Fake Reviews Using Multidimensional Representations with Fine-Grained Aspects Plan[J]. IEEE Access, 2020, 9: 3765-3773.
doi: 10.1109/ACCESS.2020.3047947
[5] 周黎宇. 基于非均衡数据分类方法的虚假评论检测研究[D]. 合肥: 合肥工业大学, 2018.
[5] (Zhou Liyu. Research on Review Spam Detection Based on Imbalanced Data Classification Method[D]. Hefei: Hefei University of Technology, 2018.)
[6] Ott M, Choi Y, Cardie C, et al. Finding Deceptive Opinion Spam by Any Stretch of the Imagination[OL]. arXiv Preprint, arXiv: 1107.4557.
[7] Fusilier D H, Montes-y-Gómez M, Rosso P, et al. Detecting Positive and Negative Deceptive Opinions Using PU-Learning[J]. Information Processing and Management, 2015, 51(4): 433-443.
doi: 10.1016/j.ipm.2014.11.001
[8] Li Y J, Wang F X, Zhang S W, et al. Detection of Fake Reviews Using Group Model[J]. Mobile Networks and Applications, 2021, 26(1): 91-103.
doi: 10.1007/s11036-020-01688-z
[9] Wang N, Yang J, Kong X F, et al. A Fake Review Identification Framework Considering the Suspicion Degree of Reviews with Time Burst Characteristics[J]. Expert Systems with Applications, 2022, 190: 116207.
doi: 10.1016/j.eswa.2021.116207
[10] Liu Z W, Dou Y T, Yu P S, et al. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 1569-1572.
[11] Zhang Y M, Fan Y J, Ye Y F, et al. Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: ACM, 2019: 549-558.
[12] 赵敏, 张月琴, 窦英通, 等. 集成层级图注意力网络检测非均衡虚假评论[J]. 计算机科学与探索, 2023, 17(2): 428-441.
doi: 10.3778/j.issn.1673-9418.2104090
[12] (Zhao Min, Zhang Yueqin, Dou Yingtong, et al. Imbalanced Fake Reviews Detection with Ensemble Hierarchical Graph Attention Network[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 428-441.)
doi: 10.3778/j.issn.1673-9418.2104090
[13] 万建武, 杨明. 代价敏感学习方法综述[J]. 软件学报, 2020, 31(1): 113-136.
[13] (Wan Jianwu, Yang Ming. Survey on Cost-Sensitive Learning Method[J]. Journal of Software, 2020, 31(1): 113-136.)
[14] Chen Y R, Chen H H. Opinion Spam Detection in Web Forum: A Real Case Study[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 173-183.
[15] Mukherjee A, Kumar A, Liu B, et al. Spotting Opinion Spammers Using Behavioral Footprints[C]// Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 632-640.
[16] Mukherjee A, Liu B, Glance N. Spotting Fake Reviewer Groups in Consumer Reviews[C]// Proceedings of the 21st International Conference on World Wide Web. New York: ACM, 2012: 191-200.
[17] Mukherjee A, Venkataraman V, Liu B, et al. What Yelp Fake Review Filter Might be Doing?[C]// Proceedings of the 7th International AAAI Conference on Web and Social Media. 2013.
[18] Rayana S, Akoglu L. Collective Opinion Spam Detection: Bridging Review Networks and Metadata[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 985-994.
[19] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE, 2017: 2999-3007.
[1] Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn