Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (2): 84-98    DOI: 10.11925/infotech.2096-3467.2022.1347
Current Issue | Archive | Adv Search |
Support for Cross-Domain Methods of Identifying Fake Comments of Chinese
Gu Yan1,Zheng Kaihong1,Hu Yongjun1(),Song Yishan2,Liu Dongping3
1School of Management, Guangzhou University, Guangzhou 510006, China
2School of Data Science, The Chinese University of Hong Kong, Shenzhen 518000, China
3Partner & Business Enabling, Amazon Web Services GCR, Beijing 100015, China
Download: PDF (1131 KB)   HTML ( 8
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper constructs a cross-domain Chinese fake review identification model (CFEE) for multi-domain datasets. It extracts the semantic information of the comment texts and addresses the problems of traditional recognition models. [Methods] First, we established 11 rules for constructing fake review datasets and created a multi-domain dataset. Then, we designed the CFEE model to identify Chinese fake comments across domains. Third, it extracted the deep semantic information with the ERNIE pre-training model. The model identified the hidden comments based on the texts' emotional attributes. Finally, it projected the text information to the word relation dimension with the convolutional neural network and realized classification based on features of neural network fusion. [Results] The CFEE model's F1 value reached 91.52% on the multi-domain Chinese fake comment datasets. The model's F1 values were 85.71%, 79.59%, 85.71%, and 85.00% on single-domain datasets for mobile phones, food, clothing, and household appliances, respectively. It outperformed the existing models significantly. [Limitations] There is subjectivity in the manual annotation. [Conclusions] The proposed method can effectively identify Chinese fake reviews across domains.

Key wordsFake Comments      ERNIE Model      Cross-Domain Identification      Chinese Semantic      Emotional Score     
Received: 21 December 2022      Published: 08 January 2024
ZTFLH:  G252  
Fund:National Social Science Fund of China(18BGL236);National Key R&D Program of China(2021YFB3301801);2nd Phase of the Ministry of Education Supply and Demand Docking Employment Education Project(20230103480)
Corresponding Authors: Hu Yongjun,ORCID:0000-0002-9395-7535,E-mail: hyjsdu96@126.com。   

Cite this article:

Gu Yan, Zheng Kaihong, Hu Yongjun, Song Yishan, Liu Dongping. Support for Cross-Domain Methods of Identifying Fake Comments of Chinese. Data Analysis and Knowledge Discovery, 2024, 8(2): 84-98.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1347     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I2/84

识别内容 识别特征
语言特征 词汇的n元模型、词性、词频统计、词语情感、不自然词汇等
文本特征 文本长度、文本复杂性、文本相似性、程度副词、情感偏差等
Identification Content of Fake Comments
Extraction Process of Chinese Text Semantic Feature
Different Mask Way of BERT Model and ERNIE Model
Flow Chart of TextCNN
Rules of Emotion Scoring
Classification Neural Networks
Framework of CFEE Cross-Domain Identification Method
真评论数 假评论数 总计 假/总
训练集T 2 207 3 349 5 556 60.28%
测试集M 775 1 225 2 000 61.25%
总计 2 982 4 574 7 556 60.53%
验证集V 791 1 209 2 000 60.45%
Proportion of Real Comments,Fake Comments,Training Set,Test Set,Validation Set
内容 手机 很好 京东 速度 不错 屏幕 系统 外观 感觉 苹果
词频 3 846 2 033 1 959 1 873 1 768 1 209 1 119 1 115 1 085 982
Top 10 Word Frequency Statistics of Fake Comments
内容 手机 很好 京东 不错 苹果 速度 系统 耳机 感觉 电脑
词频 1 418 1 086 943 913 669 666 554 495 478 442
Top 10 Word Frequency Statistics of True Comments
相似度 总真评 总假评 任一真评 任一假评
总真评 1.00 0.99 0.48 0.78
总假评 1.00 0.48 0.78
任一真评 1.00 0.35
任一假评 1.00
Similarity of Text
模型名称 模型参数 取值
ERNIE num_epochs 10
batch_size 32
pad_size 32
hidden_size 768
require_improvement 1 000
TextCNN filter_sizes 2
num_filters 256
dropout 0.5
FC fc (256+1)×2
loss函数 nn.CrossEntropyLoss()
optimzer Adam
Parameters of Model
模型 F 1
CFEE 91.98% 91.07% 91.52%
Results of CFEE Model
模型名称 召回率 F 1
第一组 90.35% 86.77% 88.52%
第二组 92.17% 87.68% 89.87%
第三组 90.80% 90.65% 90.73%
CFEE 91.98% 91.07% 91.52%
Results of Ablation Experiments
模型名称 召回率 F 1
TF-IDF-KNN 73.50% 81.89% 77.49%
TF-IDF-Decision Tree 83.96% 51.53% 63.86%
TF-IDF-Naive Bayes 96.99% 37.30% 53.88%
TF-IDF-Logistic 88.43% 87.26% 87.84%
BERT-LSTM 90.31% 84.78% 87.46%
BERT-BiLSTM 91.26% 83.79% 87.37%
CFEE 91.98% 91.07% 91.52%
Experimental Results of Different Methods
数据集 模型名称 召回率 F 1
骚扰短信 TF-IDF-KNN 96.70% 60.98% 74.79%
TF-IDF-Decision Tree 74.22% 98.57% 84.68%
TF-IDF-Naive Bayes 93.72% 92.68% 93.20%
TF-IDF-Logistic 96.20% 96.66% 94.43%
BERT-LSTM 97.14% 97.29% 97.22%
BERT-BiLSTM 97.46% 97.93% 97.70%
CFEE 97.93% 97.93% 97.93%
微博谣言 TF-IDF-KNN 75.23% 54.58% 63.26%
TF-IDF-Decision Tree 77.50% 20.26% 32.12%
TF-IDF-Naive Bayes 90.91% 65.36% 76.05%
TF-IDF-Logistic 84.23% 82.03% 83.12%
BERT-LSTM 85.76% 92.18% 88.85%
BERT-BiLSTM 76.90% 79.15% 78.01%
CFEE 85.76% 92.18% 88.85%
Experimental Results of Different Methods of Harassing Text Messages and Microblog Rumor
领域 内容 Top10词频排名(由高到底)
手机
领域
真实评论 手机>不错>速度>很好>屏幕>玩游戏>感觉>清新>电池>效果
虚假评论 手机>屏幕>很好>效果>流畅>外观>速度>电池>手感>不错
食品
领域
真实评论 不错>味道>口感>有点>京东>很好>感觉>物流>方便>坚果
虚假评论 口感>坚果>营养>方便>京东>味道>很好>新鲜>不错>健康
家电
领域
真实评论 电脑>客服>京东>东西>电视>洗衣机>问题>方便>很好>效果
虚假评论 好好>很好>外观>电视>质量>效果>京东>方便>物流>不错
服装
领域
真实评论 鞋子>很好>很舒服>质量>款式>颜色>舒服>尺码>面料>透气
虚假评论 很好>质量>款式>时尚>购物>很舒服>客服>精细>京东>舒服
Ranking of Top10 Word Frequencies in Different Fields
领域 模型名称 召回率 F 1
手机领域 TF-IDF-KNN 52.52% 68.22% 59.35%
TF-IDF-Decision Tree 74.03% 53.27% 61.95%
TF-IDF-Naive Bayes 57.14% 12.15% 20.04%
TF-IDF-Logistic 70.21% 61.68% 65.67%
BERT 70.59% 67.29% 68.90%
BERT-LSTM 75.00% 64.49% 69.35%
BERT-BiLSTM 76.14% 62.62% 68.72%
ERNIE 81.19% 76.64% 78.85%
CFEE 84.55% 86.92% 85.71%
食品领域 TF-IDF-KNN 60.36% 66.04% 57.14%
TF-IDF-Decision Tree 50.00% 4.72% 8.62%
TF-IDF-Naive Bayes 24.14% 6.60% 10.37%
TF-IDF-Logistic 68.97% 37.74% 48.79%
BERT 66.67% 60.38% 63.37%
BERT-LSTM 74.68% 55.66% 63.78%
BERT-BiLSTM 72.50% 54.72% 62.37%
ERNIE 83.87% 73.58% 78.39%
CFEE 86.67% 73.58% 79.59%
家电领域 TF-IDF-KNN 48.51% 67.71% 56.52%
TF-IDF-Decision Tree 65.08% 42.71% 51.57%
TF-IDF-Naive Bayes 41.18% 7.29% 12.39%
TF-IDF-Logistic 64.21% 63.51% 63.86%
BERT 64.20% 54.17% 58.76%
BERT-LSTM 68.18% 62.50% 65.22%
BERT-BiLSTM 73.42% 60.42% 66.89%
ERNIE 71.91% 66.67% 69.19%
CFEE 87.70% 84.38% 85.71%
服装领域 TF-IDF-KNN 23.97% 43.94% 31.02%
TF-IDF-Decision Tree 33.33% 9.09% 14.28%
TF-IDF-Naive Bayes 0% 0% -
TF-IDF-Logistic 44.30% 53.03% 48.27%
BERT 38.14% 56.06% 45.40%
BERT-LSTM 44.57% 62.12% 51.06%
BERT-BiLSTM 52.63% 45.45% 48.78%
ERNIE 69.09% 57.58% 62.81%
CFEE 94.44% 77.27% 85.00%
Experimental Results of Data Sets in Different Fields
[1] Chatterjee S, Chaudhuri R, Kumar A, et al. Impacts of Consumer Cognitive Process to Ascertain Online Fake Review: A Cognitive Dissonance Theory Approach[J]. Journal of Business Research, 2023, 154: Article No.113370.
[2] 高翠, 刘婉妮, 王硕. 数字经济背景下对消费者评论数据的挖掘[J]. 活力, 2022(11): 178-180.
[2] (Gao Cui, Liu Wanni, Wang Shuo. Mining Consumer Comment Data Under the Background of Digital Economy[J]. Vitality, 2022(11): 178-180.)
[3] Wu Y Y, Ngai E W T, Wu P K, et al. Fake Online Reviews: Literature Review, Synthesis, and Directions for Future Research[J]. Decision Support Systems, 2020, 132: Article No.113280.
[4] 魏瑾瑞, 徐晓晴. 虚假评论、消费决策与产品绩效——虚假评论能产生真实的绩效吗[J]. 南开管理评论, 2020, 23(1): 189-199.
[4] (Wei Jinrui, Xu Xiaoqing. Does Review Spam Create Real Performance: An Empirical Research Based on the Relationship Between Review Spam, Consumption Decisions and Product Performance[J]. Nankai Business Review, 2020, 23(1): 189-199.)
[5] Chen L R, Li W L, Chen H, et al. Detection of Fake Reviews: Analysis of Sellers' Manipulation Behavior[J]. Sustainability, 2019, 11(17): Article No.4802.
[6] Hu N, Bose I, Gao Y J, et al. Manipulation in Digital Word-of-Mouth: A Reality Check for Book Reviews[J]. Decision Support Systems, 2011, 50(3): 627-635.
doi: 10.1016/j.dss.2010.08.013
[7] 吴峰, 谢聪, 姬少培. 基于跨领域迁移的AM-AdpGRU金融文本分类[J]. 应用科学学报, 2022, 40(5): 828-837.
[7] (Wu Feng, Xie Cong, Ji Shaopei. AM-AdpGRU Financial Text Classification Based on Cross-Domain[J]. Journal of Applied Sciences, 2022, 40(5): 828-837.)
[8] Zhang C R, Wang G, Wang S, et al. Cross-Domain Network Attack Detection Enabled by Heterogeneous Transfer Learning[J]. Computer Networks, 2023, 227: Article No.109692.
[9] 张文韩, 刘小明, 杨关, 等. 多层结构化语义知识增强的跨领域命名实体识别[J]. 计算机研究与发展, 2023, 60(12):2864-2876.
[9] (Zhang Wenhan, Liu Xiaoming, Yang Guan, et al. Cross-Domain Named Entity Recognition of Multi-Level Structured Semantic Knowledge Enhancement[J]. Journal of Computer Research and Development, 2023, 60(12):2864-2876.)
[10] 聂卉, 王佳佳. 产品评论垃圾识别研究综述[J]. 现代图书情报技术, 2014(2): 63-71.
[10] (Nie Hui, Wang Jiajia. Review of Product Review Spams Detection[J]. New Technology of Library and Information Service, 2014(2): 63-71.)
[11] Ott M, Choi Y, Cardie C, et al. Finding Deceptive Opinion Spam by Any Stretch of the Imagination[OL]. arXiv Preprint, arXiv: 1107.4557.
[12] Jindal N, Liu B. Analyzing and Detecting Review Spam[C]// Proceedings of the 7th IEEE International Conference on Data Mining. IEEE, 2007: 547-552.
[13] Alsubari S N, Deshmukh S N, Alqarni A A, et al. Data Analytics for the Identification of Fake Reviews Using Supervised Learning[J]. Computers, Materials & Continua, 2022, 70(2): 3189-3204.
[14] 聂卉, 吴毅骏. 基于特征表现的虚假评论人预测研究[J]. 图书情报工作, 2015, 59(10): 102-109.
doi: 10.13266/j.issn.0252-3116.2015.10.015
[14] (Nie Hui, Wu Yijun. Study on Spammer Detection Based on Reviewer-Specific Characteristics[J]. Library and Information Service, 2015, 59(10): 102-109.)
doi: 10.13266/j.issn.0252-3116.2015.10.015
[15] 赵军, 王红. 融合情感极性和逻辑回归的虚假评论检测方法[J]. 智能系统学报, 2016, 11(3): 336-342.
[15] (Zhao Jun, Wang Hong. Detection of Fake Reviews Based on Emotional Orientation and Logistic Regression[J]. CAAI Transactions on Intelligent Systems, 2016, 11(3): 336-342.)
[16] 宋海霞, 严馨, 余正涛, 等. 基于自适应聚类的虚假评论检测[J]. 南京大学学报(自然科学版), 2013, 49(4): 433-438.
[16] Song Haixia, Yan Xin, Yu Zhengtao, et al. Detection of Fake Reviews Based on Adaptive Clustering[J]. Journal of Nanjing University (Natural Sciences), 2013, 49(4): 433-438.)
[17] 任亚峰, 尹兰, 姬东鸿. 基于语言结构和情感极性的虚假评论识别[J]. 计算机科学与探索, 2014, 8(3): 313-320.
doi: 10.3778/j.issn.1673-9418.1310040
[17] (Ren Yafeng, Yin Lan, Ji Donghong. Deceptive Reviews Detection Based on Language Structure and Sentiment Polarity[J]. Journal of Frontiers of Computer Science & Technology, 2014, 8(3): 313-320.)
doi: 10.3778/j.issn.1673-9418.1310040
[18] Li H Y, Liu B, Mukherjee A, et al. Spotting Fake Reviews Using Positive-Unlabeled Learning[J]. Computación y Sistemas, 2014, 18(3): 467-475.
[19] Jindal N, Liu B. Opinion Spam and Analysis[C]// Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM, 2008: 219-230.
[20] 孟园, 王悦. 基于用户-评论-商户关系的虚假用户识别研究:用户偏差分析的视角[J]. 数据分析与知识发现, 2022, 6(6): 55-70.
[20] (Meng Yuan, Wang Yue. Identifying Fake Accounts with User-Review-Shop Relationship and User Deviation Analysis[J]. Data Analysis and Knowledge Discovery, 2022, 6(6): 55-70.)
[21] Vidanagama D U, Silva A T P, Karunananda A S. Ontology Based Sentiment Analysis for Fake Review Detection[J]. Expert Systems with Applications, 2022, 206: Article No.117869.
[22] 任亚峰, 姬东鸿, 张红斌, 等. 基于PU学习算法的虚假评论识别研究[J]. 计算机研究与发展, 2015, 52(3): 639-648.
[22] (Ren Yafeng, Ji Donghong, Zhang Hongbin, et al. Deceptive Reviews Detection Based on Positive and Unlabeled Learning[J]. Journal of Computer Research and Development, 2015, 52(3): 639-648.)
[23] Lee M, Song Y H, Li L, et al. Detecting Fake Reviews with Supervised Machine Learning Algorithms[J]. The Service Industries Journal, 2022, 42(13-14): 1101-1121.
doi: 10.1080/02642069.2022.2054996
[24] 缪裕青, 欧威健, 刘同来, 等. 基于情感极性与SMOTE过采样的虚假评论识别方法[J]. 计算机应用研究, 2018, 35(7): 2042-2045.
[24] (Miao Yuqing, Ou Weijian, Liu Tonglai, et al. Detection of Fake Reviews Based on Sentiment Polarity and Over-Sampling[J]. Application Research of Computers, 2018, 35(7): 2042-2045.)
[25] 朱娟. 在线商品虚假评论关键问题研究综述[J]. 现代情报, 2017, 37(5): 166-171.
doi: 10.3969/j.issn.1008-0821.2017.05.028
[25] (Zhu Juan. A Review of Key Issues in the Opinion Spams of Online Products[J]. Journal of Modern Information, 2017, 37(5): 166-171.)
doi: 10.3969/j.issn.1008-0821.2017.05.028
[26] 皮琪, 王文杰, 杨飞, 等. 基于深度学习的虚假评论识别[J]. 网络新媒体技术, 2016, 5(6): 30-33.
[26] (Pi Qi, Wang Wenjie, Yang Fei, et al. Spam Review Detection Based on Deep Learning Framework[J]. Journal of Network New Media, 2016, 5(6): 30-33.)
[27] Xu Y Z, Li Q. Attention-Based Feature Fusion Network for Fake Reviews Detection[C]// Proceedings of the 3rd International Conference on Artificial Intelligence and Advanced Manufacture. ACM, 2021: 666-671.
[28] Mohawesh R, Xu S X, Springer M, et al. Fake or Genuine? Contextualised Text Representation for Fake Review Detection[OL]. arXiv Preprint, arXiv: 2112.14343.
[29] 林婧雯, 李建敦, 王赢胜, 等. 在线商品评论中的虚假评论识别模型研究[J]. 福建电脑, 2022, 38(8): 10-13.
[29] (Lin Jingwen, Li Jiandun, Wang Yingsheng, et al. Research on the Identification Model of False Comments on Online Goods[J]. Journal of Fujian Computer, 2022, 38(8): 10-13.)
[30] 施运梅, 袁博, 张乐, 等. IMTS: 融合图像与文本语义的虚假评论检测方法[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[30] (Shi Yunmei, Yuan Bo, Zhang Le, et al. IMTS: Detecting Fake Reviews with Image and Text Semantics[J]. Data Analysis and Knowledge Discovery, 2022, 6(8): 84-96.)
[31] Zhou G Y, He T T, Wu W S, et al. Linking Heterogeneous Input Features with Pivots for Domain Adaptation[C]// Proceedings of the 24th International Conference on Artificial Intelligence. ACM, 2015: 1419-1425.
[32] Sun Y, Wang S H, Li Y K, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv Preprint, arXiv: 1904.09223.
[33] Vaswani A, Shazeer N M, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv: 1706.03762.
[34] Chen Y. Convolutional Neural Network for Sentence Classification[D]. Waterloo: University of Waterloo, 2015.
[35] 刘策, 李贞, 颜明会. 面向大众点评网评论的文本情感分析研究[J]. 现代信息科技, 2021, 5(19): 37-39.
[35] (Liu Ce, Li Zhen, Yan Minghui. Research on Text Emotion Analysis for Comments on Public Comments Network[J]. Modern Information Technology, 2021, 5(19): 37-39.)
[36] 孟美任, 丁晟春. 虚假商品评论信息发布者行为动机分析[J]. 情报科学, 2013, 31(10): 100-104.
[36] (Meng Meiren, Ding Shengchun. Motivation and Behavior of the Fraud Reviews' Publishers[J]. Information Science, 2013, 31(10): 100-104.)
[37] 张文, 王强, 马振中, 等. 在线商品虚假评论发布动机及形成机理研究[J]. 中国管理科学, 2022, 30(7): 176-188.
[37] (Zhang Wen, Wang Qiang, Ma Zhenzhong, et al. Research on the Motivation and Formation Mechanism of Online Products Deceptive Reviews[J]. Chinese Journal of Management Science, 2022, 30(7): 176-188.)
[38] Alonso M A, Vilares D, Gómez-Rodríguez C, et al. Sentiment Analysis for Fake News Detection[J]. Electronics, 2021, 10(11): Article No.1348.
[39] 陈燕方, 李志宇. 基于评论产品属性情感倾向评估的虚假评论识别研究[J]. 现代图书情报技术, 2014(9): 81-90.
[39] (Chen Yanfang, Li Zhiyu. Research on Product Review Attribute-Based of Emotion Evaluate Review Spam Detection[J]. New Technology of Library and Information Service, 2014(9): 81-90.)
[40] 汤皓星. 商品虚假评论检测技术研究及软件实现[D]. 兰州: 西北民族大学, 2021.
[40] (Tang Haoxing. Research on Technology Detection of Commodity Fake Review and Software Implementation[D]. Lanzhou: Northwest Minzu University, 2021.)
[1] Zhang Zhipeng, Mao Yusheng, Zhang Liyi. Classifying Reasons of Hotel Reviews with Domain ERNIE and BiLSTM Model[J]. 数据分析与知识发现, 2022, 6(9): 65-76.
[2] Zhang Runtong,Chen Donghua,Zhao Hongmei,Zhu Xiaomin. Computer-Assisted ICD-11 Coding Method Based on Chinese Semantic Analysis[J]. 数据分析与知识发现, 2020, 4(4): 44-55.
[3] Chang Chun. Using Skill of Search Engine Google[J]. 现代图书情报技术, 2004, 20(6): 53-55.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn