Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (11): 14-25     https://doi.org/10.11925/infotech.2096-3467.2022.1020
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种融合知识与Res-ViT的特征增强多模态情感识别模型*
杨茹芸,马静()
南京航空航天大学经济与管理学院 南京 211106
A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT
Yang Ruyun,Ma Jing()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China
全文: PDF (2024 KB)   HTML ( 17
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 改善多模态特征提取的质量,提高对多模态舆情中用户情感的识别精度。【方法】 针对文本模态,使用RoBERTa进行特征提取,并通过知识短语表征词典进行知识增强;针对图像模态,整合ResNet与视觉Transformer,提出Res-ViT模型;特征融合部分使用Transformer编码器,最后将多模态表示输入全连接层中进行情感识别。【结果】 在MVSA-Multiple数据集上,情感识别的准确率、F1值分别为71.66%、69.42%,较基准方法的最高值分别提高2.22、0.59个百分点。【局限】 未使用其他数据集进一步验证模型的泛化性与稳健性。【结论】 本文模型能够更好地提取并有效融合多模态特征,提升了多模态情感识别的能力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨茹芸
马静
关键词 多模态情感识别特征增强    
Abstract

[Objective] This paper aims to enhance the quality of multi-modal feature extraction and improve the accuracy of netizen sentiment recognition for multi-modal public opinion. [Methods] First, we extracted features of the text modality using RoBERTa and enhanced them with the knowledge phrase representation dictionary. Then, we proposed a Res-ViT model for the graph modality, combining ResNet and Vision Transformer. Finally, we fused multi-modal features with Transformer encoders and fed the representations to the fully connected layer for sentiment recognition. [Results] We evaluated our model using the MVSA-Multiple dataset and achieved an accuracy of 71.66% and an F1 score of 69.42% for sentiment recognition. These improvements were 2.22% and 0.59% over the best scores of the baseline methods. [Limitations] More research is needed to examine the model with other datasets to verify its generalizability and robustness. [Conclusions] The proposed model could more effectively extract and fuse multi-modal features and improve the accuracy of sentiment recognition.

Key wordsMulti-modal    Sentiment Recognition    Feature Enhancement
收稿日期: 2022-09-26      出版日期: 2023-03-22
ZTFLH:  G350 TP391  
基金资助:*国家自然科学基金面上项目(72174086);南京航空航天大学研究生科研与创新实践项目的研究成果之一(xcxjh20220910)
通讯作者: 马静,ORCID:0000-0001-8472-2581,E-mail:majing5525@126.com。   
引用本文:   
杨茹芸, 马静. 一种融合知识与Res-ViT的特征增强多模态情感识别模型*[J]. 数据分析与知识发现, 2023, 7(11): 14-25.
Yang Ruyun, Ma Jing. A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT. Data Analysis and Knowledge Discovery, 2023, 7(11): 14-25.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1020      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I11/14
Fig.1  融合知识与Res-ViT的特征增强多模态情感识别模型框架
Fig.2  文本特征提取模块
Fig.3  Res-ViT模型结构
Fig.4  Transformer编码器结构
情感类别 训练集 验证集 测试集 总计
积极 9 053 1 131 1 131 11 315
中性 3 528 440 440 4 408
消极 1 040 129 129 1 298
总计 13 621 1 700 1 700 17 021
Table 1  一次实验的数据集分布
参数 参数值
批大小(Batch Size) 32
学习率(Learning Rate) 3e-5
预热学习率(Warmup Rate) 0.1
优化器(Optimizer) AdamW
正则化(Dropout) 0.1
损失函数(Loss) CrossEntropy Loss
图像Transformer层数 2
融合Transformer层数 4
Table 2  模型参数设置
方法 模态 模型 Accuracy/% F1/%
基准方法 文本 BERT 67.95 65.84
RoBERTa 68.21 65.72
RoBERTa-Enhanced 68.97 66.95
图片 ResNet 66.40 61.17
ViT 66.05 60.90
Res-ViT 67.18 62.43
图文 RoBERTa- ResNet-E 69.44 66.98
RoBERTa- ResNet-L 66.57 63.75
MultiSentiNet 68.86 68.11
HSAN 67.96 67.76
Co-Mem 68.92 68.83
本文方法 图文 RERV-Concat 69.76 67.50
RERV-Transformer 71.66 69.42
RERV-LXMERT 70.63 68.36
RERV-MulT 70.42 68.14
Table 3  模型效果对比
模型 Accuracy/% F1/%
RoBERTa 68.21 65.72
ResNet 66.12 59.50
+融合Transformer 70.35 67.59
+文本知识增强 71.02 68.66
+图像Transformer 71.66 69.42
Table 4  模型消融实验结果
Fig.5  Transformer编码器层数对比
Fig.6  可视化分析示例
[1] Zhao J, Gui X, Zhang X. Deep Convolution Neural Networks for Twitter Sentiment Analysis[J]. IEEE Access, 2018, 6: 23253-23260.
doi: 10.1109/ACCESS.2017.2776930
[2] Rehman A U, Malik A K, Raza B, et al. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis[J]. Multimedia Tools and Applications, 2019, 78(18): 26597-26613.
doi: 10.1007/s11042-019-07788-7
[3] Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, et al. ABCDM: An Attention-Based Bidirectional CNN-RNN Deep Model for Sentiment Analysis[J]. Future Generation Computer Systems, 2021, 115: 279-294.
doi: 10.1016/j.future.2020.08.005
[4] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. ACM, 2017: 5998-6008.
[5] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
[6] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[7] Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv preprint, arXiv:1904.09223.
[8] Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling Language Representation with Knowledge Graph[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 2901-2908.
[9] Ke P, Ji H, Liu S, et al. SentiLARE: Sentiment-aware Language Representation Learning with Linguistic Knowledge[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6975-6988.
[10] Zhong P, Wang D, Miao C. Knowledge-enriched Transformer for Emotion Detection in Textual Conversations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 165-176.
[11] Tian H, Gao C, Xiao X, et al. SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 4067-4076.
[12] Tenney I, Xia P, Chen B, et al. What do You Learn from Context?Probing for Sentence Structure in Contextualized Word Representations[C]// Proceedings of the 7th International Conference on Learning Representations. 2019.
[13] Roberts A, Raffel C, Shazeer N. How Much Knowledge Can You Pack into the Parameters of a Language Model?[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 5418-5426.
[14] Jia J, Wu S, Wang X, et al. Can We Understand Van Gogh’s Mood?Learning to Infer Affects from Images in Social Networks[C]// Proceedings of the 20th ACM International Conference on Multimedia. 2012: 857-860.
[15] Borth D, Ji R, Chen T, et al. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs[C]// Proceedings of the 21st ACM International Conference on Multimedia. 2013: 223-232.
[16] Xu C, Cetintas S, Lee K C, et al. Visual Sentiment Prediction with Deep Convolutional Neural Networks[OL]. arXiv Preprint, arXiv:1411.5731.
[17] Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65: 15-22.
doi: 10.1016/j.imavis.2017.01.011
[18] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[19] Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141.
[20] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.
[21] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[OL]. arXiv Preprint, arXiv: 2010.11929.
[22] Raghu M, Unterthiner T, Kornblith S, et al. Do Vision Transformers See Like Convolutional Neural Networks?[J]. Advances in Neural Information Processing Systems, 2021, 34: 12116-12128.
[23] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[24] Niu T, Zhu S, Pang L, et al. Sentiment Analysis on Multi-View Social Data[C]// Proceedings of the International Conference on Multimedia Modeling. Springer, Cham, 2016: 15-27.
[25] Xu N, Mao W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 2399-2402.
[26] Vadicamo L, Carrara F, Cimino A, et al. Cross-Media Learning for Image Sentiment Analysis in the Wild[C]// Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017: 308-317.
[27] Xu N. Analyzing Multimodal Public Sentiment Based on Hierarchical Semantic Attentional Network[C]// Proceedings of the IEEE International Conference on Intelligence and Security Informatics. 2017: 152-154.
[28] Xu N, Mao W, Chen G. A Co-Memory Network for Multimodal Sentiment Analysis[C]// Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 929-932.
[29] Li L, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv:1908.03557.
[30] Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5100-5111.
[31] Tsai Y H, Bai S, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.
[1] 刘洋, 丁星辰, 马莉莉, 王淳洋, 朱立芳. 基于多维度图卷积网络的旅游评论有用性识别*[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[2] 赵萌, 王昊, 李晓敏. 中国民歌多情感识别及情感变化规律分析研究*[J]. 数据分析与知识发现, 2023, 7(7): 111-124.
[3] 刘洋, 张雯, 胡毅, 毛进, 黄菲. 基于多模态深度学习的酒店股票预测*[J]. 数据分析与知识发现, 2023, 7(5): 21-32.
[4] 张昱, 张海军, 刘雅情, 梁科晋, 王月阳. 基于双向掩码注意力机制的多模态情感分析*[J]. 数据分析与知识发现, 2023, 7(4): 46-55.
[5] 潘华莉, 谢珺, 高婧, 续欣莹, 王长征. 融合多模态特征的深度强化学习推荐模型*[J]. 数据分析与知识发现, 2023, 7(4): 114-128.
[6] 赵朝阳, 朱贵波, 王金桥. ChatGPT给语言大模型带来的启示和多模态大模型新的发展思路*[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[7] 王昊, 龚丽娟, 周泽聿, 范涛, 王永生. 融合语义增强的社交媒体虚假信息检测方法研究*[J]. 数据分析与知识发现, 2023, 7(2): 48-60.
[8] 强子珊,顾益军. 基于多模态异质图的社交媒体谣言检测模型*[J]. 数据分析与知识发现, 2023, 7(11): 68-78.
[9] 张艳琼, 朱兆松, 赵晓驰. 面向手语语言学的中国手语词汇多模态语料库构建研究*[J]. 数据分析与知识发现, 2023, 7(10): 144-155.
[10] 吴思思, 马静. 基于感知融合的多任务多模态情感分析模型*[J]. 数据分析与知识发现, 2023, 7(10): 74-84.
[11] 余本功, 季晓晗. 基于ADGCN-MFM的多模态讽刺检测研究*[J]. 数据分析与知识发现, 2023, 7(10): 85-94.
[12] 陈圆圆, 马静. 基于SC-Attention机制的多模态讽刺检测研究*[J]. 数据分析与知识发现, 2022, 6(9): 40-51.
[13] 施运梅, 袁博, 张乐, 吕学强. IMTS:融合图像与文本语义的虚假评论检测方法*[J]. 数据分析与知识发现, 2022, 6(8): 84-96.
[14] 范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
[15] 李纲, 张霁, 毛进. 面向突发事件画像的社交媒体图像分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn