[Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.
刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.
Knight K, Graehl J. Machine Transliteration[J]. Computational Linguistics, 1998,24:599-612.
Armand J, Edouard G, Piotr B, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 427-431.
Theresa W, Janyce W, Paul H. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2005: 347-354.
John D L, McCallum A, Fernando C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2003: 282-289.
Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020.
Chen J A, Yang Z C, Yang D Y. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901. 11196.
Sennrich R, Haddow B, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511. 06709.
Nadler B, Srebro N, Zhou X Y. Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data[C]// Proceedings of the 20th International Conference on Neural Information Processing Systems December. 2007: 801-808.
Bachman P, Alsharif O, Precup D. Learning with Pseudo-Ensembles[C]// Proceedings of the 27th International Conference on Neural Information Processing. 2014: 3365-3373.
Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization[C]// Proceedings of the 26th International Conference on Neural Information Processing. 2005: 529-536.
Lee D H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks[C]// Proceedings of ICML 2013 Workshop: Challenges in Representation Learning. 2013: 1-6.
Samuli L, Timo A. Temporal Ensembling for Semi-Supervised Learning[C]// Proceeding of the 5th International Conference on Learning Representations. 2017.
Miyato T, Maeda S I, Koyama M, et al. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(8):1979-1993.
Berthelot D, Carlini N, Goodfellow I, et al. MixMatch: A Holistic Approach to Semi-Supervised Learning[C]// Proceedings of the 33rd International Conference on Neural Information Processing System, 2019.
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.