支持跨领域的中文虚假评论识别方法<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.1347

数据分析与知识发现

2024, Vol. 8

Issue (2): 84-98 https://doi.org/10.11925/infotech.2096-3467.2022.1347

研究论文

本期目录 | 过刊浏览 | 高级检索

支持跨领域的中文虚假评论识别方法^*

谷岩¹,郑楷洪¹,胡勇军¹(

),宋益善²,刘东屏³

¹广州大学管理学院广州 510006
²香港中文大学数据科学学院深圳 518000
³亚马逊云科技大中华区合作伙伴及业务赋能部北京 100015

Support for Cross-Domain Methods of Identifying Fake Comments of Chinese

Gu Yan¹,Zheng Kaihong¹,Hu Yongjun¹(

),Song Yishan²,Liu Dongping³

¹School of Management, Guangzhou University, Guangzhou 510006, China
²School of Data Science, The Chinese University of Hong Kong, Shenzhen 518000, China
³Partner & Business Enabling, Amazon Web Services GCR, Beijing 100015, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1131 KB) HTML ( 8 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 在多领域数据集的基础上，构建一种基于评论文本深层词关系语义信息提取的支持跨领域的中文虚假评论识别模型CFEE，解决传统识别方法较少考虑中文评论文本中存在不同领域数据差异性和领域虚假评论数据隐藏性的问题。【方法】 提出11条虚假评论数据集建立规则，建立多领域数据集；构建CFEE模型跨领域识别中文虚假评论，其主要功能为基于ERNIE预训练模型提取文本深层语义信息、基于评论文本情感属性识别评论隐藏性、基于卷积神经网络将文本信息投射到词关系维度、基于神经网络融合特征实现分类。【结果】 CFEE模型在多领域中文虚假评论数据集上的F₁值为91.52%，在手机、食品、服装、家电等单领域数据集上的F₁值分别为85.71%、79.59%、85.71%、85.00%，效果均显著优于现有模型。【局限】 存在人工标注的主观性。【结论】 本文所提识别方法能够有效地跨领域识别中文虚假评论。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	谷岩
	郑楷洪
	胡勇军
	宋益善
	刘东屏

关键词 ：虚假评论, ERNIE模型, 跨领域识别, 中文语义, 情感得分

Abstract：

[Objective] This paper constructs a cross-domain Chinese fake review identification model (CFEE) for multi-domain datasets. It extracts the semantic information of the comment texts and addresses the problems of traditional recognition models. [Methods] First, we established 11 rules for constructing fake review datasets and created a multi-domain dataset. Then, we designed the CFEE model to identify Chinese fake comments across domains. Third, it extracted the deep semantic information with the ERNIE pre-training model. The model identified the hidden comments based on the texts' emotional attributes. Finally, it projected the text information to the word relation dimension with the convolutional neural network and realized classification based on features of neural network fusion. [Results] The CFEE model's F₁ value reached 91.52% on the multi-domain Chinese fake comment datasets. The model's F₁ values were 85.71%, 79.59%, 85.71%, and 85.00% on single-domain datasets for mobile phones, food, clothing, and household appliances, respectively. It outperformed the existing models significantly. [Limitations] There is subjectivity in the manual annotation. [Conclusions] The proposed method can effectively identify Chinese fake reviews across domains.

Key words： Fake Comments ERNIE Model Cross-Domain Identification Chinese Semantic Emotional Score

收稿日期: 2022-12-21 出版日期: 2024-01-08

ZTFLH:

G252

基金资助:*国家社会科学基金项目(18BGL236);国家重点研发计划(2021YFB3301801);教育部第二期供需对接就业育人项目重点领域校企合作项目(20230103480)

通讯作者: 胡勇军，ORCID：0000-0002-9395-7535，E-mail： hyjsdu96@126.com。

引用本文:

谷岩, 郑楷洪, 胡勇军, 宋益善, 刘东屏. 支持跨领域的中文虚假评论识别方法^*[J]. 数据分析与知识发现, 2024, 8(2): 84-98.
Gu Yan, Zheng Kaihong, Hu Yongjun, Song Yishan, Liu Dongping. Support for Cross-Domain Methods of Identifying Fake Comments of Chinese. Data Analysis and Knowledge Discovery, 2024, 8(2): 84-98.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1347 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I2/84

Table 1 虚假评论的识别内容

Fig.1 中文文本语义特征提取流程

Fig.2 BERT模型和ERNIE模型不同的掩码方式

Fig.3 TextCNN流程图

Fig.4 情感得分规则

Fig.5 特征融合分类神经网络层

Fig.6 CFEE跨领域识别方法的框架

Table 2 真评、假评、训练集、测试集、验证集数据占比

Table 3 虚假评论Top10词频统计

Table 4 真实评论Top10词频统计

Table 5 文本相似度

Table 6 模型参数

模型	$准确性$	$召回率$	$F 1$ 值
CFEE	91.98%	91.07%	91.52%

Table 7 CFEE模型结果

模型名称	$准确性$	召回率	$F 1$ 值
第一组	90.35%	86.77%	88.52%
第二组	92.17%	87.68%	89.87%
第三组	90.80%	90.65%	90.73%
CFEE	91.98%	91.07%	91.52%

Table 8 消融实验结果

模型名称	$准确性$	召回率	$F 1$ 值
TF-IDF-KNN	73.50%	81.89%	77.49%
TF-IDF-Decision Tree	83.96%	51.53%	63.86%
TF-IDF-Naive Bayes	96.99%	37.30%	53.88%
TF-IDF-Logistic	88.43%	87.26%	87.84%
BERT-LSTM	90.31%	84.78%	87.46%
BERT-BiLSTM	91.26%	83.79%	87.37%
CFEE	91.98%	91.07%	91.52%

Table 9 不同方法实验结果

数据集	模型名称	$准确性$	召回率	$F 1$ 值
骚扰短信	TF-IDF-KNN	96.70%	60.98%	74.79%
	TF-IDF-Decision Tree	74.22%	98.57%	84.68%
	TF-IDF-Naive Bayes	93.72%	92.68%	93.20%
	TF-IDF-Logistic	96.20%	96.66%	94.43%
	BERT-LSTM	97.14%	97.29%	97.22%
	BERT-BiLSTM	97.46%	97.93%	97.70%
	CFEE	97.93%	97.93%	97.93%
微博谣言	TF-IDF-KNN	75.23%	54.58%	63.26%
	TF-IDF-Decision Tree	77.50%	20.26%	32.12%
	TF-IDF-Naive Bayes	90.91%	65.36%	76.05%
	TF-IDF-Logistic	84.23%	82.03%	83.12%
	BERT-LSTM	85.76%	92.18%	88.85%
	BERT-BiLSTM	76.90%	79.15%	78.01%
	CFEE	85.76%	92.18%	88.85%

Table 10 骚扰短信数据集和微博谣言数据集不同方法实验结果

Table 11 不同领域数据集Top10词频排名

领域	模型名称	$准确性$	召回率	$F 1$ 值
手机领域	TF-IDF-KNN	52.52%	68.22%	59.35%
	TF-IDF-Decision Tree	74.03%	53.27%	61.95%
	TF-IDF-Naive Bayes	57.14%	12.15%	20.04%
	TF-IDF-Logistic	70.21%	61.68%	65.67%
	BERT	70.59%	67.29%	68.90%
	BERT-LSTM	75.00%	64.49%	69.35%
	BERT-BiLSTM	76.14%	62.62%	68.72%
	ERNIE	81.19%	76.64%	78.85%
	CFEE	84.55%	86.92%	85.71%
食品领域	TF-IDF-KNN	60.36%	66.04%	57.14%
	TF-IDF-Decision Tree	50.00%	4.72%	8.62%
	TF-IDF-Naive Bayes	24.14%	6.60%	10.37%
	TF-IDF-Logistic	68.97%	37.74%	48.79%
	BERT	66.67%	60.38%	63.37%
	BERT-LSTM	74.68%	55.66%	63.78%
	BERT-BiLSTM	72.50%	54.72%	62.37%
	ERNIE	83.87%	73.58%	78.39%
	CFEE	86.67%	73.58%	79.59%
家电领域	TF-IDF-KNN	48.51%	67.71%	56.52%
	TF-IDF-Decision Tree	65.08%	42.71%	51.57%
	TF-IDF-Naive Bayes	41.18%	7.29%	12.39%
	TF-IDF-Logistic	64.21%	63.51%	63.86%
	BERT	64.20%	54.17%	58.76%
	BERT-LSTM	68.18%	62.50%	65.22%
	BERT-BiLSTM	73.42%	60.42%	66.89%
	ERNIE	71.91%	66.67%	69.19%
	CFEE	87.70%	84.38%	85.71%
服装领域	TF-IDF-KNN	23.97%	43.94%	31.02%
	TF-IDF-Decision Tree	33.33%	9.09%	14.28%
	TF-IDF-Naive Bayes	0%	0%	-
	TF-IDF-Logistic	44.30%	53.03%	48.27%
	BERT	38.14%	56.06%	45.40%
	BERT-LSTM	44.57%	62.12%	51.06%
	BERT-BiLSTM	52.63%	45.45%	48.78%
	ERNIE	69.09%	57.58%	62.81%
	CFEE	94.44%	77.27%	85.00%

Table 12 不同领域数据集实验结果

[1]	Chatterjee S, Chaudhuri R, Kumar A, et al. Impacts of Consumer Cognitive Process to Ascertain Online Fake Review: A Cognitive Dissonance Theory Approach[J]. Journal of Business Research, 2023, 154: Article No.113370.
[2]	高翠, 刘婉妮, 王硕. 数字经济背景下对消费者评论数据的挖掘[J]. 活力, 2022(11): 178-180.
[2]	(Gao Cui, Liu Wanni, Wang Shuo. Mining Consumer Comment Data Under the Background of Digital Economy[J]. Vitality, 2022(11): 178-180.)
[3]	Wu Y Y, Ngai E W T, Wu P K, et al. Fake Online Reviews: Literature Review, Synthesis, and Directions for Future Research[J]. Decision Support Systems, 2020, 132: Article No.113280.
[4]	魏瑾瑞, 徐晓晴. 虚假评论、消费决策与产品绩效——虚假评论能产生真实的绩效吗[J]. 南开管理评论, 2020, 23(1): 189-199.
[4]	(Wei Jinrui, Xu Xiaoqing. Does Review Spam Create Real Performance: An Empirical Research Based on the Relationship Between Review Spam, Consumption Decisions and Product Performance[J]. Nankai Business Review, 2020, 23(1): 189-199.)
[5]	Chen L R, Li W L, Chen H, et al. Detection of Fake Reviews: Analysis of Sellers' Manipulation Behavior[J]. Sustainability, 2019, 11(17): Article No.4802.
[6]	Hu N, Bose I, Gao Y J, et al. Manipulation in Digital Word-of-Mouth: A Reality Check for Book Reviews[J]. Decision Support Systems, 2011, 50(3): 627-635. doi: 10.1016/j.dss.2010.08.013
[7]	吴峰, 谢聪, 姬少培. 基于跨领域迁移的AM-AdpGRU金融文本分类[J]. 应用科学学报, 2022, 40(5): 828-837.
[7]	(Wu Feng, Xie Cong, Ji Shaopei. AM-AdpGRU Financial Text Classification Based on Cross-Domain[J]. Journal of Applied Sciences, 2022, 40(5): 828-837.)
[8]	Zhang C R, Wang G, Wang S, et al. Cross-Domain Network Attack Detection Enabled by Heterogeneous Transfer Learning[J]. Computer Networks, 2023, 227: Article No.109692.
[9]	张文韩, 刘小明, 杨关, 等. 多层结构化语义知识增强的跨领域命名实体识别[J]. 计算机研究与发展, 2023, 60(12):2864-2876.
[9]	(Zhang Wenhan, Liu Xiaoming, Yang Guan, et al. Cross-Domain Named Entity Recognition of Multi-Level Structured Semantic Knowledge Enhancement[J]. Journal of Computer Research and Development, 2023, 60(12):2864-2876.)
[10]	聂卉, 王佳佳. 产品评论垃圾识别研究综述[J]. 现代图书情报技术, 2014(2): 63-71.
[10]	(Nie Hui, Wang Jiajia. Review of Product Review Spams Detection[J]. New Technology of Library and Information Service, 2014(2): 63-71.)
[11]	Ott M, Choi Y, Cardie C, et al. Finding Deceptive Opinion Spam by Any Stretch of the Imagination[OL]. arXiv Preprint, arXiv: 1107.4557.
[12]	Jindal N, Liu B. Analyzing and Detecting Review Spam[C]// Proceedings of the 7th IEEE International Conference on Data Mining. IEEE, 2007: 547-552.
[13]	Alsubari S N, Deshmukh S N, Alqarni A A, et al. Data Analytics for the Identification of Fake Reviews Using Supervised Learning[J]. Computers, Materials & Continua, 2022, 70(2): 3189-3204.
[14]	聂卉, 吴毅骏. 基于特征表现的虚假评论人预测研究[J]. 图书情报工作, 2015, 59(10): 102-109. doi: 10.13266/j.issn.0252-3116.2015.10.015
[14]	(Nie Hui, Wu Yijun. Study on Spammer Detection Based on Reviewer-Specific Characteristics[J]. Library and Information Service, 2015, 59(10): 102-109.) doi: 10.13266/j.issn.0252-3116.2015.10.015
[15]	赵军, 王红. 融合情感极性和逻辑回归的虚假评论检测方法[J]. 智能系统学报, 2016, 11(3): 336-342.
[15]	(Zhao Jun, Wang Hong. Detection of Fake Reviews Based on Emotional Orientation and Logistic Regression[J]. CAAI Transactions on Intelligent Systems, 2016, 11(3): 336-342.)
[16]	宋海霞, 严馨, 余正涛, 等. 基于自适应聚类的虚假评论检测[J]. 南京大学学报(自然科学版), 2013, 49(4): 433-438.
[16]	Song Haixia, Yan Xin, Yu Zhengtao, et al. Detection of Fake Reviews Based on Adaptive Clustering[J]. Journal of Nanjing University (Natural Sciences), 2013, 49(4): 433-438.)
[17]	任亚峰, 尹兰, 姬东鸿. 基于语言结构和情感极性的虚假评论识别[J]. 计算机科学与探索, 2014, 8(3): 313-320. doi: 10.3778/j.issn.1673-9418.1310040
[17]	(Ren Yafeng, Yin Lan, Ji Donghong. Deceptive Reviews Detection Based on Language Structure and Sentiment Polarity[J]. Journal of Frontiers of Computer Science & Technology, 2014, 8(3): 313-320.) doi: 10.3778/j.issn.1673-9418.1310040
[18]	Li H Y, Liu B, Mukherjee A, et al. Spotting Fake Reviews Using Positive-Unlabeled Learning[J]. Computación y Sistemas, 2014, 18(3): 467-475.
[19]	Jindal N, Liu B. Opinion Spam and Analysis[C]// Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM, 2008: 219-230.
[20]	孟园, 王悦. 基于用户-评论-商户关系的虚假用户识别研究:用户偏差分析的视角[J]. 数据分析与知识发现, 2022, 6(6): 55-70.
[20]	(Meng Yuan, Wang Yue. Identifying Fake Accounts with User-Review-Shop Relationship and User Deviation Analysis[J]. Data Analysis and Knowledge Discovery, 2022, 6(6): 55-70.)
[21]	Vidanagama D U, Silva A T P, Karunananda A S. Ontology Based Sentiment Analysis for Fake Review Detection[J]. Expert Systems with Applications, 2022, 206: Article No.117869.
[22]	任亚峰, 姬东鸿, 张红斌, 等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3): 639-648.
[22]	(Ren Yafeng, Ji Donghong, Zhang Hongbin, et al. Deceptive Reviews Detection Based on Positive and Unlabeled Learning[J]. Journal of Computer Research and Development, 2015, 52(3): 639-648.)
[23]	Lee M, Song Y H, Li L, et al. Detecting Fake Reviews with Supervised Machine Learning Algorithms[J]. The Service Industries Journal, 2022, 42(13-14): 1101-1121. doi: 10.1080/02642069.2022.2054996
[24]	缪裕青, 欧威健, 刘同来, 等. 基于情感极性与SMOTE过采样的虚假评论识别方法[J]. 计算机应用研究, 2018, 35(7): 2042-2045.
[24]	(Miao Yuqing, Ou Weijian, Liu Tonglai, et al. Detection of Fake Reviews Based on Sentiment Polarity and Over-Sampling[J]. Application Research of Computers, 2018, 35(7): 2042-2045.)
[25]	朱娟. 在线商品虚假评论关键问题研究综述[J]. 现代情报, 2017, 37(5): 166-171. doi: 10.3969/j.issn.1008-0821.2017.05.028
[25]	(Zhu Juan. A Review of Key Issues in the Opinion Spams of Online Products[J]. Journal of Modern Information, 2017, 37(5): 166-171.) doi: 10.3969/j.issn.1008-0821.2017.05.028
[26]	皮琪, 王文杰, 杨飞, 等. 基于深度学习的虚假评论识别[J]. 网络新媒体技术, 2016, 5(6): 30-33.
[26]	(Pi Qi, Wang Wenjie, Yang Fei, et al. Spam Review Detection Based on Deep Learning Framework[J]. Journal of Network New Media, 2016, 5(6): 30-33.)
[27]	Xu Y Z, Li Q. Attention-Based Feature Fusion Network for Fake Reviews Detection[C]// Proceedings of the 3rd International Conference on Artificial Intelligence and Advanced Manufacture. ACM, 2021: 666-671.
[28]	Mohawesh R, Xu S X, Springer M, et al. Fake or Genuine? Contextualised Text Representation for Fake Review Detection[OL]. arXiv Preprint, arXiv: 2112.14343.
[29]	林婧雯, 李建敦, 王赢胜, 等. 在线商品评论中的虚假评论识别模型研究[J]. 福建电脑, 2022, 38(8): 10-13.
[29]	(Lin Jingwen, Li Jiandun, Wang Yingsheng, et al. Research on the Identification Model of False Comments on Online Goods[J]. Journal of Fujian Computer, 2022, 38(8): 10-13.)
[30]	施运梅, 袁博, 张乐, 等. IMTS: 融合图像与文本语义的虚假评论检测方法[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[30]	(Shi Yunmei, Yuan Bo, Zhang Le, et al. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. Data Analysis and Knowledge Discovery, 2022, 6(8): 84-96.)
[31]	Zhou G Y, He T T, Wu W S, et al. Linking Heterogeneous Input Features with Pivots for Domain Adaptation[C]// Proceedings of the 24th International Conference on Artificial Intelligence. ACM, 2015: 1419-1425.
[32]	Sun Y, Wang S H, Li Y K, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv: 1904.09223.
[33]	Vaswani A, Shazeer N M, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv: 1706.03762.
[34]	Chen Y. Convolutional Neural Network for Sentence Classification[D]. Waterloo: University of Waterloo, 2015.
[35]	刘策, 李贞, 颜明会. 面向大众点评网评论的文本情感分析研究[J]. 现代信息科技, 2021, 5(19): 37-39.
[35]	(Liu Ce, Li Zhen, Yan Minghui. Research on Text Emotion Analysis for Comments on Public Comments Network[J]. Modern Information Technology, 2021, 5(19): 37-39.)
[36]	孟美任, 丁晟春. 虚假商品评论信息发布者行为动机分析[J]. 情报科学, 2013, 31(10): 100-104.
[36]	(Meng Meiren, Ding Shengchun. Motivation and Behavior of the Fraud Reviews' Publishers[J]. Information Science, 2013, 31(10): 100-104.)
[37]	张文, 王强, 马振中, 等. 在线商品虚假评论发布动机及形成机理研究[J]. 中国管理科学, 2022, 30(7): 176-188.
[37]	(Zhang Wen, Wang Qiang, Ma Zhenzhong, et al. Research on the Motivation and Formation Mechanism of Online Products Deceptive Reviews[J]. Chinese Journal of Management Science, 2022, 30(7): 176-188.)
[38]	Alonso M A, Vilares D, Gómez-Rodríguez C, et al. Sentiment Analysis for Fake News Detection[J]. Electronics, 2021, 10(11): Article No.1348.
[39]	陈燕方, 李志宇. 基于评论产品属性情感倾向评估的虚假评论识别研究[J]. 现代图书情报技术, 2014(9): 81-90.
[39]	(Chen Yanfang, Li Zhiyu. Research on Product Review Attribute-Based of Emotion Evaluate Review Spam Detection[J]. New Technology of Library and Information Service, 2014(9): 81-90.)
[40]	汤皓星. 商品虚假评论检测技术研究及软件实现[D]. 兰州: 西北民族大学, 2021.
[40]	(Tang Haoxing. Research on Technology Detection of Commodity Fake Review and Software Implementation[D]. Lanzhou: Northwest Minzu University, 2021.)

[1]	刘美玲, 尚玥, 赵铁军, 周继云. 基于代价敏感学习的不平衡虚假评论处理模型^*[J]. 数据分析与知识发现, 2023, 7(6): 113-122.
[2]	张治鹏, 毛煜升, 张李义. 基于领域ERNIE和BiLSTM模型的酒店评论观点原因分类研究^*[J]. 数据分析与知识发现, 2022, 6(9): 65-76.
[3]	施运梅, 袁博, 张乐, 吕学强. IMTS：融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[4]	张润彤,陈东华,赵红梅,朱晓敏. 基于中文语义分析的计算机辅助ICD-11编码方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 44-55.
[5]	吴佳芬,马费成. 产品虚假评论文本识别方法研究述评 ^*[J]. 数据分析与知识发现, 2019, 3(9): 1-15.
[6]	陈燕方, 李志宇. 基于评论产品属性情感倾向评估的虚假评论识别研究[J]. 现代图书情报技术, 2014, 30(9): 81-90.

Viewed

Full text

Abstract

Cited

Shared

Discussed