Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (5): 51-58    DOI: 10.11925/infotech.2096-3467.2020.1170
Current Issue | Archive | Adv Search |
A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation
Liu Tong,Liu Chen,Ni Weijian()
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Download: PDF (915 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      

[Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.

Key wordsSentiment Analysis      Semi-Supervised Learning      Consistency Regularity      Data Augmentation     
Received: 27 November 2020      Published: 27 May 2021
ZTFLH:  TP393  
Fund:The work is supported by the National Natural Science Foundation of China(71704096);The work is supported by the National Natural Science Foundation of China(61602278);the Qingdao Social Science Planning Project(QDSKL2001117)
Corresponding Authors: Ni Weijian     E-mail:

Cite this article:

Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.

URL:     OR

The Structure of SA-MLA
数据集 文档数量 情感类别数量
Waimai 11 988 2
Dmsc 27 389 5
Weibo 8 000 4
The Statistics of Three Datasets
方法 Waimai Dmsc Weibo
BERT 84.632% 51.662% 42.067%
TextCNN 81.715% 46.944% 41.483%
UDA 85.432% 44.986% 47.547%
SA-MLA 86.943% 45.319% 48.793%
The Prediction Effect of Each Method
方法 有标签数据 Waimai Dmsc Weibo
UDA 500 86.231% 41.837% 42.174%
1 000 85.432% 44.986% 47.547%
2 000 92.082% 46.039% 53.812%
SA-MLA 500 86.456% 42.413% 42.838%
1 000 86.943% 45.319% 48.793%
2 000 92.621% 46.825% 54.427%
The Performance of Semi-Supervised Methods Using Different Number of Labeled Documents
The Performance of UDA and SA-MLA Using Different Number of Unlabeled Samples
[1] Knight K, Graehl J. Machine Transliteration[J]. Computational Linguistics, 1998,24:599-612.
[2] Armand J, Edouard G, Piotr B, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 427-431.
[3] Theresa W, Janyce W, Paul H. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2005: 347-354.
[4] John D L, McCallum A, Fernando C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2003: 282-289.
[5] Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020.
[6] Chen J A, Yang Z C, Yang D Y. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
[7] Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901. 11196.
[8] Sennrich R, Haddow B, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511. 06709.
[9] Nadler B, Srebro N, Zhou X Y. Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data[C]// Proceedings of the 20th International Conference on Neural Information Processing Systems December. 2007: 801-808.
[10] Bachman P, Alsharif O, Precup D. Learning with Pseudo-Ensembles[C]// Proceedings of the 27th International Conference on Neural Information Processing. 2014: 3365-3373.
[11] Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization[C]// Proceedings of the 26th International Conference on Neural Information Processing. 2005: 529-536.
[12] Lee D H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks[C]// Proceedings of ICML 2013 Workshop: Challenges in Representation Learning. 2013: 1-6.
[13] Samuli L, Timo A. Temporal Ensembling for Semi-Supervised Learning[C]// Proceeding of the 5th International Conference on Learning Representations. 2017.
[14] Miyato T, Maeda S I, Koyama M, et al. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(8):1979-1993.
doi: 10.1109/TPAMI.34
[15] Berthelot D, Carlini N, Goodfellow I, et al. MixMatch: A Holistic Approach to Semi-Supervised Learning[C]// Proceedings of the 33rd International Conference on Neural Information Processing System, 2019.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
[17] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[1] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[2] Li Feifei,Wu Fan,Wang Zhongqing. Sentiment Analysis with Reviewer Types and Generative Adversarial Network[J]. 数据分析与知识发现, 2021, 5(4): 72-79.
[3] Chang Chengyang,Wang Xiaodong,Zhang Shenglei. Polarity Analysis of Dynamic Political Sentiments from Tweets with Deep Learning Method[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[4] Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao. Grouping Microblog Users of Trending Topics Based on Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[5] Lv Huakui,Liu Zhenghao,Qian Yuxing,Hong Xudong. Relationship Between Financial News and Stock Market Fluctuations[J]. 数据分析与知识发现, 2021, 5(1): 99-111.
[6] Xu Hongxia,Yu Qianqian,Qian Li. Studying Content Interaction Data with Topic Model and Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(7): 110-117.
[7] Jiang Lin,Zhang Qilin. Research on Academic Evaluation Based on Fine-Grain Citation Sentimental Quantification[J]. 数据分析与知识发现, 2020, 4(6): 129-138.
[8] Shi Lei,Wang Yi,Cheng Ying,Wei Ruibin. Review of Attention Mechanism in Natural Language Processing[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[9] Li Tiejun,Yan Duanwu,Yang Xiongfei. Recommending Microblogs Based on Emotion-Weighted Association Rules[J]. 数据分析与知识发现, 2020, 4(4): 27-33.
[10] Shen Zhuo,Li Yan. Mining User Reviews with PreLM-FT Fine-Grain Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[11] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[12] Zhang Yipeng,Ma Jingdong. Analyzing Sentiments and Dissemination of Misinformation on Public Health Emergency[J]. 数据分析与知识发现, 2020, 4(12): 45-54.
[13] Ying Tan,Jin Zhang,Lixin Xia. A Survey of Sentiment Analysis on Social Media[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[14] Hui Nie,Huan He. Identifying Implicit Features with Word Embedding[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[15] Yonghua Cen,Zhihao Tan,Chengyao Wu. Impacts of Financial Media Information on Stock Market: An Empirical Study of Sentiment Analysis[J]. 数据分析与知识发现, 2019, 3(9): 98-114.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938