Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (5): 102-112    DOI: 10.11925/infotech.2096-3467.2023.0519
Sentiment Analysis of User Reviews Integrating Margin Sampling and Tri-training
Jiang Yiping1,Zhang Ting1,Xia Zhengming1,Li Yuhua2,Zhang Zhaotong1()
1College of Information Management, Nanjing Agricultural University, Nanjing 210031, China
2College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210031, China
[Objective] This paper proposes a sentiment analysis method for user reviews integrating margin sampling and tri-training. It addresses the issues of the large volume of user reviews, ambiguous sentiment tendencies, and short content. [Methods] First, we constructed a multi-class support vector machine based on a one-vs-all decomposition strategy. Then, we integrated a margin sampling strategy considering cosine similarity to create an initial set. Finally, we proposed a Tri-training algorithm combining a soft voting mechanism. [Results] The proposed algorithm improved the voting mechanism in the Tri-training algorithm, which further reduced the probability of misjudgment in sample classification by multiple classifiers. All categories achieved precision rates above 79%. [Limitations] The proposed method does not consider extracting information from multimedia data. [Conclusions] Compared with traditional and recently improved semi-supervised learning algorithms, the proposed algorithm demonstrates classification accuracy and efficiency superiority.

Key wordsUser Reviews      Sentiment Analysis      Margin Sampling      Tri-Training     
Received: 31 May 2023      Published: 08 January 2024
ZTFLH:  TP391  
Fund:Social Science Foundation of Jiangsu Province(21GLC003);Humanity and Social Science Project of Ministry of Education of China(22YJA630033);Postgraduate Research & Practice Innovation Program of Jiangsu Province(SJCX23_0229)
Corresponding Authors: Zhang Zhaotong,ORCID:0000-0002-1155-8603,E-mail:。   

Jiang Yiping, Zhang Ting, Xia Zhengming, Li Yuhua, Zhang Zhaotong. Sentiment Analysis of User Reviews Integrating Margin Sampling and Tri-training. Data Analysis and Knowledge Discovery, 2024, 8(5): 102-112.

E-Commerce Review Sentiment Analysis Framework Based on Margin Sampling and Tri-training
Improved Margin Sampling
Data Preprocessing
评论内容 分组 分词结果
看起来很新鲜,京东自营的生鲜质量很令人放心 5 看起来/很/新鲜,京东/自营/的/生鲜/质量/很/令人放心
大品牌鲜果,很新鲜,选这个牌子也是精挑细选了好久好久,送货速度快 5 大品牌/鲜果,很/新鲜,选/这个/牌子/也是/精挑细选/了/好久好久,送货/速度/快
纯进口鲜果口感特别香甜软糯,值得大家购买 5 纯/进口/鲜果/口感/特别/香甜/软糯,值得/大家/购买
…… …… ……
这次买的不好,坏的很快 1 这次/买/的/不好,坏/的/很快
70块钱买了一堆臭东西,最信任的平台 1 70/块钱/买/了/一堆/臭/东西,最/信任/的/平台
Word Segmentation and Grouping of E-commerce Reviews
Number of Reviews on Main Product Attributes
The One vs All Strategy
Demonstration of the One vs All Decomposition
方法 Labeled5 Labeled10
情感类别 5 4 3 2 1 5 4 3 2 1
评论总数 704 711 703 695 687 709 691 714 700 686
Self-training 精确率(%) 63.3 65.5 71.4 73.3 70.0 66.1 70.7 66.1 72.9 74.8
召回率(%) 65.3 67.2 70.3 69.9 71.4 63.9 71.8 75.6 67.7 72.3
F1值(%) 64.3 66.3 70.8 71.6 70.7 65.0 71.2 70.5 70.2 73.5
Tri-training 精确率(%) 70.2 73.5 75.3 75.9 74.0 72.5 74.3 76.2 73.3 75.8
召回率(%) 72.7 77.0 80.9 81.2 75.8 76.5 69.9 79.9 74.9 73.1
F1值(%) 73.4 77.7 80.6 80.5 76.9 74.4 72.0 78.0 72.5 74.4
DW-TCI 精确率(%) 75.7 76.7 78.1 77.4 76.0 77.1 77.6 78.9 77.9 76.5
召回率(%) 74.7 77.7 81.4 81.8 76.9 77.8 73.4 79.2 74.3 74.5
F1值(%) 75.2 78.1 80.8 80.9 77.1 77.9 74.5 78.6 75.3 76.4
改进SVM 精确率(%) 76.1 77.3 78.5 77.9 76.5 77.4 78.4 79.2 78.1 78.8
召回率(%) 75.1 77.9 81.9 82.2 77.1 78.5 75.1 78.9 76.2 77.4
F1值(%) 76.7 78.5 81.2 81.5 78.3 78.6 75.7 79.2 77.5 78.1
IMS-Tri-training 精确率(%) 79.5 82.6 81.9 82.1 81.2 84.3 86.8 82.8 81.9 87.3
召回率(%) 84.1 80.2 84.4 83.0 82.8 87.3 82.2 79.0 79.1 84.9
F1值(%) 81.7 81.4 83.1 82.5 82.0 85.8 84.4 80.9 80.5 86.1
Classification Effect of Five Semi-Supervised Learning Methods
Classification Accuracy of Semi-Supervised Learning Methods with Different Sample Sizes
Classification Analysis Time of Semi-Supervised Learning Methods with Different Sample Sizes
分布函数 平方误差和 AIC BIC KL散度
exponpow 0.073 24.710 -43 633.213 0.210
t 0.107 24.130 -42 111.546 0.289
norm 0.107 22.130 -42 119.752 0.289
lognorm 0.107 24.134 -42 108.580 0.289
cauchy 0.109 26.339 -42 040.730 0.314
Fitting Distribution of IMS-Tri-training Classification Data
The Fitting of Different Distributions to the Sentiment Classification of Online Reviews
