Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (6): 113-122     https://doi.org/10.11925/infotech.2096-3467.2022.0442
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于代价敏感学习的不平衡虚假评论处理模型*
刘美玲1(),尚玥1,赵铁军2,周继云3
1东北林业大学信息与计算机工程学院 哈尔滨 150006
2哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001
3约翰·霍普金斯大学利伯研究所 巴尔的摩 MD21218
Unbalanced Fake Review Processing Model Based on Cost-Sensitive Learning
Liu Meiling1(),Shang Yue1,Zhao Tiejun2,Zhou Jiyun3
1School of Information and Computer Engineering, Northeast Forestry University, Harbin 150006, China
2School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
3Lieber Institute, Johns Hopkins University, Baltimore, MD 21218, USA
全文: PDF (1010 KB)   HTML ( 11
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 增强虚假评论识别任务中模型对文本深层语义信息的学习能力,并解决虚假评论识别任务中存在的严重的数据不平衡问题。【方法】 基于数据本身的用户行为特征与文本特征进行类间可分性计算自动学习代价敏感矩阵,增强模型对不平衡数据的学习能力;同时利用BERT在文本编码方面的能力进一步优化模型。【结果】 在YelpCHI数据集上进行实验,对比现有先进方法(En-HGAN),本文模型的F1值提升了约18个百分点,AUC值提升了约12个百分点。【局限】 未将所提模型应用到更多的研究领域中。【结论】 将用户行为特征与评论文本特征看作虚假评论类与真实类之间的特征集合进行类别可分性计算能够有效提高模型对虚假评论识别的性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘美玲
尚玥
赵铁军
周继云
关键词 虚假评论识别类别可分性计算代价敏感学习不平衡数据处理    
Abstract

[Objective] This study aims to enhance the detection of fake reviews by improving the model’s ability to learn deep semantic information from text and addressing the problem of data imbalance. [Methods] User behavior and text characteristics of the dataset were analyzed to automatically calculate a cost-sensitive matrix based on inter-class separability, thereby improving the model’s ability to learn from unbalanced data. Additionally, the text encoding ability of BERT was utilized to optimize the model further. [Results] Extensive experiments on the YelpCHI dataset showed that the proposed model outperformed existing advanced methods with an 18% improvement in F1 value and a 12% improvement in AUC value. [Limitations] While the proposed method has achieved promising results, further research is needed to explore its applicability to other domains. [Conclusions] Leveraging user behavior and text features for category separability calculation effectively enhances the performance of the model in detecting fake reviews. The proposed method’s integration of cost-sensitive matrix and BERT’s text encoding ability holds great potential for improving the detection of fake reviews.

Key wordsFake Review Detection    Class Separability Computation    Cost-Sensitive Learning    Unbalanced Data Processing
收稿日期: 2022-05-06      出版日期: 2023-08-09
ZTFLH:  TP393  
  G250  
基金资助:* 黑龙江省自然科学基金项目(LH2022F002);国家自然科学基金青年科学基金项目(61702091)
通讯作者: 刘美玲ORCID:0000-0003-4208-7274,E-mail: mlliu@nefu.edu.cn。   
引用本文:   
刘美玲, 尚玥, 赵铁军, 周继云. 基于代价敏感学习的不平衡虚假评论处理模型*[J]. 数据分析与知识发现, 2023, 7(6): 113-122.
Liu Meiling, Shang Yue, Zhao Tiejun, Zhou Jiyun. Unbalanced Fake Review Processing Model Based on Cost-Sensitive Learning. Data Analysis and Knowledge Discovery, 2023, 7(6): 113-122.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0442      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I6/113
标注方法 数据集名称 数据集描述 数据集领域
基于规则 Amazon product review 5.8MB评论,2.15MB用户 书籍、DVD、音乐、产品
Amazon book review 6.8KB论,4.8KB用户 书籍
TripAdvisor booking review 2.8KB评论 旅馆
基于人力 Epinion 6KB评论 产品
TripAdvisor 3KB评论 旅馆
基于算法过滤 Yelp hotel 67.4KB评论,38KB用户 旅馆和餐厅
Yelp hotel 3.59MB评论,16KB用户 旅馆和餐厅
Yelp hotel 6.09MB评论,9KB用户 旅馆和餐厅
点评 10KB评论,9KB用户 餐厅
Amazon 众包 TripAdvisor Hotel 1 600评论 旅馆
TripAdvisor multidomain 3 032评论 旅馆
Table1  虚假评论研究公开数据集
Fig.1  MIANA-B模型结构
用户行为特征 描述
评分偏差(RD)[14] 计算单用户评分与产品平均评分的偏差
评分偏差阈值(DEV)[15] 计算RD归一化的偏差
极端评分(EXT)[15] 判断用户评分是否极端化
单日最多评论数(MNR)[16] 判断用户是否为单日最多评论发布者
好评率(PR)[17] 计算用户发表好评的概率
差评率(NR)[17] 计算用户发表差评的概率
Table2  用户行为特征
评论句子 特征集
I went here on Friday night during Restaurant Week. The lamb meat was excellent. The fillet mignon wasn’t as good as I had hoped. I loved their pao de queijo. However, I was disappointed when the server brought the banana cream pie. It was too sweet, and we didn’t even get a chance to choose our dessert. [-1.0027, 0.0, 0.0, 1.0, 0.4, 0.0, 11.0, 0.0, 67.0, 4.0, 4.0]
My girlfriend and I reserve this as a “special” dinner place for certain occasions. Tremendous neopolitan pizza, great insalata mista, and easily the best limoncello anywhere. It’s tough to get a table at prime time on a weekend - go there on a weeknight. One of the fun things about going here for is how relaxed and happy all the patrons seem. The owner is very nice, too, and will visit your table if things aren’t too busy. [1.0358, 0.0, 1.0, 1.0, 1.0, 0.0, 6.0, 0.0, 91.0, 1.0, 2.0]
Table3  用户行为特征和评论文本特征描述性统计
比较项目 YelpCHI
真实评论 虚假评论
评论数量 58 476 8 919
数据占比 86.70% 13.20%
平均长度 170 120
好评占比 74.29% 70.34%
差评占比 25.71% 29.66%
Table4  YelpCHI数据集统计说明
模型 F1/% AUC/%
Player2Vec 46.08 54.03
GraphConsis 57.76 74.28
En-HGAN 57.82 75.16
MIANA 64.78 86.65
MIANA-BERT 65.03 86.68
MIANA-C 68.83 86.37
MIANA+Focal Loss 67.83 86.04
Ours:MIANA-B 75.97 87.34
Table5  MIANA-B实验结果
Fig.2  MIANA-B及对比模型对于两个类别的分类精准率
Fig.3  MIANA-B及对比模型对于两个类别的分类召回率
Fig.4  MIANA-B及对比模型对于两个类别的分类F1值
[1] Bajaj S, Garg N, Singh S K. A Novel User-Based Spam Review Detection[J]. Procedia Computer Science, 2017, 122: 1009-1015.
doi: 10.1016/j.procs.2017.11.467
[2] Felbermayr A, Nanopoulos A. The Role of Emotions for the Perceived Usefulness in Online Customer Reviews[J]. Journal of Interactive Marketing, 2016, 36: 60-76.
doi: 10.1016/j.intmar.2016.05.004
[3] Jindal N, Liu B. Review Spam Detection[C]// Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 1189-1190.
[4] Liu M L, Shang Y, Yue Q, et al. Detecting Fake Reviews Using Multidimensional Representations with Fine-Grained Aspects Plan[J]. IEEE Access, 2020, 9: 3765-3773.
doi: 10.1109/ACCESS.2020.3047947
[5] 周黎宇. 基于非均衡数据分类方法的虚假评论检测研究[D]. 合肥: 合肥工业大学, 2018.
[5] (Zhou Liyu. Research on Review Spam Detection Based on Imbalanced Data Classification Method[D]. Hefei: Hefei University of Technology, 2018.)
[6] Ott M, Choi Y, Cardie C, et al. Finding Deceptive Opinion Spam by Any Stretch of the Imagination[OL]. arXiv Preprint, arXiv: 1107.4557.
[7] Fusilier D H, Montes-y-Gómez M, Rosso P, et al. Detecting Positive and Negative Deceptive Opinions Using PU-Learning[J]. Information Processing and Management, 2015, 51(4): 433-443.
doi: 10.1016/j.ipm.2014.11.001
[8] Li Y J, Wang F X, Zhang S W, et al. Detection of Fake Reviews Using Group Model[J]. Mobile Networks and Applications, 2021, 26(1): 91-103.
doi: 10.1007/s11036-020-01688-z
[9] Wang N, Yang J, Kong X F, et al. A Fake Review Identification Framework Considering the Suspicion Degree of Reviews with Time Burst Characteristics[J]. Expert Systems with Applications, 2022, 190: 116207.
doi: 10.1016/j.eswa.2021.116207
[10] Liu Z W, Dou Y T, Yu P S, et al. Alleviating the Inconsistency Problem of Applying Graph Neural Network to Fraud Detection[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 1569-1572.
[11] Zhang Y M, Fan Y J, Ye Y F, et al. Key Player Identification in Underground Forums over Attributed Heterogeneous Information Network Embedding Framework[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York: ACM, 2019: 549-558.
[12] 赵敏, 张月琴, 窦英通, 等. 集成层级图注意力网络检测非均衡虚假评论[J]. 计算机科学与探索, 2023, 17(2): 428-441.
doi: 10.3778/j.issn.1673-9418.2104090
[12] (Zhao Min, Zhang Yueqin, Dou Yingtong, et al. Imbalanced Fake Reviews Detection with Ensemble Hierarchical Graph Attention Network[J]. Journal of Frontiers of Computer Science and Technology, 2023, 17(2): 428-441.)
doi: 10.3778/j.issn.1673-9418.2104090
[13] 万建武, 杨明. 代价敏感学习方法综述[J]. 软件学报, 2020, 31(1): 113-136.
[13] (Wan Jianwu, Yang Ming. Survey on Cost-Sensitive Learning Method[J]. Journal of Software, 2020, 31(1): 113-136.)
[14] Chen Y R, Chen H H. Opinion Spam Detection in Web Forum: A Real Case Study[C]// Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 173-183.
[15] Mukherjee A, Kumar A, Liu B, et al. Spotting Opinion Spammers Using Behavioral Footprints[C]// Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013: 632-640.
[16] Mukherjee A, Liu B, Glance N. Spotting Fake Reviewer Groups in Consumer Reviews[C]// Proceedings of the 21st International Conference on World Wide Web. New York: ACM, 2012: 191-200.
[17] Mukherjee A, Venkataraman V, Liu B, et al. What Yelp Fake Review Filter Might be Doing?[C]// Proceedings of the 7th International AAAI Conference on Web and Social Media. 2013.
[18] Rayana S, Akoglu L. Collective Opinion Spam Detection: Bridging Review Networks and Metadata[C]// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2015: 985-994.
[19] Lin T Y, Goyal P, Girshick R, et al. Focal Loss for Dense Object Detection[C]// Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE, 2017: 2999-3007.
[1] 张云秋, 李博诚, 陈妍. 面向不平衡数据的电子病历自动分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
[2] 吴佳芬,马费成. 产品虚假评论文本识别方法研究述评 *[J]. 数据分析与知识发现, 2019, 3(9): 1-15.
[3] 陈燕方, 李志宇. 基于评论产品属性情感倾向评估的虚假评论识别研究[J]. 现代图书情报技术, 2014, 30(9): 81-90.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn