Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (5): 51-58     https://doi.org/10.11925/infotech.2096-3467.2020.1170
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
多层次数据增强的半监督中文情感分析方法*
刘彤,刘琛,倪维健()
山东科技大学计算机科学与工程学院 青岛 266590
A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation
Liu Tong,Liu Chen,Ni Weijian()
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
全文: PDF (915 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对在自然语言处理领域中高质量的标签数据较难获取的问题,设计基于多层次数据增强的半监督中文情感分析方法。【方法】 采用简单数据增强和反向翻译的文本增强技术获取大量无标签数据,通过对无标签数据计算一致性正则提取无标签数据的数据信号;对弱增强数据计算其预判标签,将强增强数据与预判标签一起构建监督训练信号,通过置信度阈值过滤使模型得出置信度高的预测结果。【结果】 在三个公开情感分析数据集上进行实验,在Waimai和Weibo数据集上仅使用1 000条有标签文档就可以分别获得超过BERT 2.311%和6.726%的性能提升。【局限】 实验均在公开通用语料上进行,未验证在垂直领域数据集上的效果。【结论】 所提方法充分挖掘了无标签数据的信息,可以缓解标签数据不易获取的问题,同时具有较强的预测稳定性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘彤
刘琛
倪维健
关键词 情感分析半监督学习一致性正则数据增强    
Abstract

[Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.

Key wordsSentiment Analysis    Semi-Supervised Learning    Consistency Regularity    Data Augmentation
收稿日期: 2020-11-27      出版日期: 2021-05-27
ZTFLH:  TP393  
基金资助:*本文系国家自然科学基金项目(项目编号)(71704096);*本文系国家自然科学基金项目(项目编号)(61602278);青岛社会科学规划项目(项目编号)的研究成果之一(QDSKL2001117)
通讯作者: 倪维健     E-mail: niweijian@gmail.com
引用本文:   
刘彤,刘琛,倪维健. 多层次数据增强的半监督中文情感分析方法*[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
Liu Tong,Liu Chen,Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.1170      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I5/51
Fig. 1  SA-MLA方法结构图
数据集 文档数量 情感类别数量
Waimai 11 988 2
Dmsc 27 389 5
Weibo 8 000 4
Table 1  三种数据集数据信息统计
方法 Waimai Dmsc Weibo
BERT 84.632% 51.662% 42.067%
TextCNN 81.715% 46.944% 41.483%
UDA 85.432% 44.986% 47.547%
SA-MLA 86.943% 45.319% 48.793%
Table 2  各方法预测效果对比
方法 有标签数据 Waimai Dmsc Weibo
UDA 500 86.231% 41.837% 42.174%
1 000 85.432% 44.986% 47.547%
2 000 92.082% 46.039% 53.812%
SA-MLA 500 86.456% 42.413% 42.838%
1 000 86.943% 45.319% 48.793%
2 000 92.621% 46.825% 54.427%
Table 3  半监督方法使用不同数量有标签文档的性能
Fig.2  UDA和SA-MLA方法在各数据集上使用不同数量无标签文档的预测性能
[1] Knight K, Graehl J. Machine Transliteration[J]. Computational Linguistics, 1998,24:599-612.
[2] Armand J, Edouard G, Piotr B, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017: 427-431.
[3] Theresa W, Janyce W, Paul H. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2005: 347-354.
[4] John D L, McCallum A, Fernando C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2003: 282-289.
[5] Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020.
[6] Chen J A, Yang Z C, Yang D Y. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
[7] Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[OL]. arXiv Preprint, arXiv: 1901. 11196.
[8] Sennrich R, Haddow B, Birch A. Improving Neural Machine Translation Models with Monolingual Data[OL]. arXiv Preprint, arXiv: 1511. 06709.
[9] Nadler B, Srebro N, Zhou X Y. Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data[C]// Proceedings of the 20th International Conference on Neural Information Processing Systems December. 2007: 801-808.
[10] Bachman P, Alsharif O, Precup D. Learning with Pseudo-Ensembles[C]// Proceedings of the 27th International Conference on Neural Information Processing. 2014: 3365-3373.
[11] Grandvalet Y, Bengio Y. Semi-Supervised Learning by Entropy Minimization[C]// Proceedings of the 26th International Conference on Neural Information Processing. 2005: 529-536.
[12] Lee D H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks[C]// Proceedings of ICML 2013 Workshop: Challenges in Representation Learning. 2013: 1-6.
[13] Samuli L, Timo A. Temporal Ensembling for Semi-Supervised Learning[C]// Proceeding of the 5th International Conference on Learning Representations. 2017.
[14] Miyato T, Maeda S I, Koyama M, et al. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019,41(8):1979-1993.
doi: 10.1109/TPAMI.34
[15] Berthelot D, Carlini N, Goodfellow I, et al. MixMatch: A Holistic Approach to Semi-Supervised Learning[C]// Proceedings of the 33rd International Conference on Neural Information Processing System, 2019.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
[17] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[1] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[2] 王雨竹,谢珺,陈波,续欣莹. 基于跨模态上下文感知注意力的多模态情感分析 *[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[3] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[4] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[5] 韩普, 张伟, 张展鹏, 王宇欣, 方浩宇. 基于特征融合和多通道的突发公共卫生事件微博情感分析*[J]. 数据分析与知识发现, 2021, 5(11): 68-79.
[6] 吕华揆,刘政昊,钱宇星,洪旭东. 异质性财经新闻与股市关系研究*[J]. 数据分析与知识发现, 2021, 5(1): 99-111.
[7] 徐红霞,于倩倩,钱力. 基于主题模型和情感分析的话题交互数据观点对抗性分析 *[J]. 数据分析与知识发现, 2020, 4(7): 110-117.
[8] 姜霖,张麒麟. 基于引文细粒度情感量化的学术评价研究*[J]. 数据分析与知识发现, 2020, 4(6): 129-138.
[9] 石磊,王毅,成颖,魏瑞斌. 自然语言处理中的注意力机制研究综述*[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[10] 李铁军,颜端武,杨雄飞. 基于情感加权关联规则的微博推荐研究*[J]. 数据分析与知识发现, 2020, 4(4): 27-33.
[11] 沈卓,李艳. 基于PreLM-FT细粒度情感分析的餐饮业用户评论挖掘[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[12] 薛福亮,刘丽芳. 一种基于CRF与ATAE-LSTM的细粒度情感分析方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[13] 张翼鹏,马敬东. 突发公共卫生事件误导信息受众情感分析及传播特征研究*[J]. 数据分析与知识发现, 2020, 4(12): 45-54.
[14] 谭荧,张进,夏立新. 社交媒体情境下的情感分析研究综述[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[15] 聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn