Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (11): 14-25    DOI: 10.11925/infotech.2096-3467.2022.1020
Current Issue | Archive | Adv Search |
A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT
Yang Ruyun,Ma Jing()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China
Download: PDF (2024 KB)   HTML ( 17
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to enhance the quality of multi-modal feature extraction and improve the accuracy of netizen sentiment recognition for multi-modal public opinion. [Methods] First, we extracted features of the text modality using RoBERTa and enhanced them with the knowledge phrase representation dictionary. Then, we proposed a Res-ViT model for the graph modality, combining ResNet and Vision Transformer. Finally, we fused multi-modal features with Transformer encoders and fed the representations to the fully connected layer for sentiment recognition. [Results] We evaluated our model using the MVSA-Multiple dataset and achieved an accuracy of 71.66% and an F1 score of 69.42% for sentiment recognition. These improvements were 2.22% and 0.59% over the best scores of the baseline methods. [Limitations] More research is needed to examine the model with other datasets to verify its generalizability and robustness. [Conclusions] The proposed model could more effectively extract and fuse multi-modal features and improve the accuracy of sentiment recognition.

Key wordsMulti-modal      Sentiment Recognition      Feature Enhancement     
Received: 26 September 2022      Published: 22 March 2023
ZTFLH:  G350 TP391  
Fund:National Natural Science Foundation of China(72174086);Postgraduate Research & Practice Innovation Program of Nanjing University of Aeronautics and Astronautics(xcxjh20220910)
Corresponding Authors: Ma Jing,ORCID:0000-0001-8472-2581,E-mail:majing5525@126.com。   

Cite this article:

Yang Ruyun, Ma Jing. A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT. Data Analysis and Knowledge Discovery, 2023, 7(11): 14-25.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1020     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I11/14

Feature-Enhanced Multi-Model Emotion Recognition Model Integrating Knowledge and Res-ViT
Text Feature Extraction Module
Structure of Res-ViT
Structure of Transformer Encoder
情感类别 训练集 验证集 测试集 总计
积极 9 053 1 131 1 131 11 315
中性 3 528 440 440 4 408
消极 1 040 129 129 1 298
总计 13 621 1 700 1 700 17 021
Distribution of Dataset for One Experiment
参数 参数值
批大小(Batch Size) 32
学习率(Learning Rate) 3e-5
预热学习率(Warmup Rate) 0.1
优化器(Optimizer) AdamW
正则化(Dropout) 0.1
损失函数(Loss) CrossEntropy Loss
图像Transformer层数 2
融合Transformer层数 4
Parameter Setting
方法 模态 模型 Accuracy/% F1/%
基准方法 文本 BERT 67.95 65.84
RoBERTa 68.21 65.72
RoBERTa-Enhanced 68.97 66.95
图片 ResNet 66.40 61.17
ViT 66.05 60.90
Res-ViT 67.18 62.43
图文 RoBERTa- ResNet-E 69.44 66.98
RoBERTa- ResNet-L 66.57 63.75
MultiSentiNet 68.86 68.11
HSAN 67.96 67.76
Co-Mem 68.92 68.83
本文方法 图文 RERV-Concat 69.76 67.50
RERV-Transformer 71.66 69.42
RERV-LXMERT 70.63 68.36
RERV-MulT 70.42 68.14
Comparison of Model Performance
模型 Accuracy/% F1/%
RoBERTa 68.21 65.72
ResNet 66.12 59.50
+融合Transformer 70.35 67.59
+文本知识增强 71.02 68.66
+图像Transformer 71.66 69.42
Ablation Results of Model
Comparison of Different Layer of Transformer
Example of Visualization Analysis
[1] Zhao J, Gui X, Zhang X. Deep Convolution Neural Networks for Twitter Sentiment Analysis[J]. IEEE Access, 2018, 6: 23253-23260.
doi: 10.1109/ACCESS.2017.2776930
[2] Rehman A U, Malik A K, Raza B, et al. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis[J]. Multimedia Tools and Applications, 2019, 78(18): 26597-26613.
doi: 10.1007/s11042-019-07788-7
[3] Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, et al. ABCDM: An Attention-Based Bidirectional CNN-RNN Deep Model for Sentiment Analysis[J]. Future Generation Computer Systems, 2021, 115: 279-294.
doi: 10.1016/j.future.2020.08.005
[4] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. ACM, 2017: 5998-6008.
[5] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
[6] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[7] Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv preprint, arXiv:1904.09223.
[8] Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling Language Representation with Knowledge Graph[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 2901-2908.
[9] Ke P, Ji H, Liu S, et al. SentiLARE: Sentiment-aware Language Representation Learning with Linguistic Knowledge[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6975-6988.
[10] Zhong P, Wang D, Miao C. Knowledge-enriched Transformer for Emotion Detection in Textual Conversations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 165-176.
[11] Tian H, Gao C, Xiao X, et al. SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 4067-4076.
[12] Tenney I, Xia P, Chen B, et al. What do You Learn from Context?Probing for Sentence Structure in Contextualized Word Representations[C]// Proceedings of the 7th International Conference on Learning Representations. 2019.
[13] Roberts A, Raffel C, Shazeer N. How Much Knowledge Can You Pack into the Parameters of a Language Model?[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 5418-5426.
[14] Jia J, Wu S, Wang X, et al. Can We Understand Van Gogh’s Mood?Learning to Infer Affects from Images in Social Networks[C]// Proceedings of the 20th ACM International Conference on Multimedia. 2012: 857-860.
[15] Borth D, Ji R, Chen T, et al. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs[C]// Proceedings of the 21st ACM International Conference on Multimedia. 2013: 223-232.
[16] Xu C, Cetintas S, Lee K C, et al. Visual Sentiment Prediction with Deep Convolutional Neural Networks[OL]. arXiv Preprint, arXiv:1411.5731.
[17] Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65: 15-22.
doi: 10.1016/j.imavis.2017.01.011
[18] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[19] Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141.
[20] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.
[21] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[OL]. arXiv Preprint, arXiv: 2010.11929.
[22] Raghu M, Unterthiner T, Kornblith S, et al. Do Vision Transformers See Like Convolutional Neural Networks?[J]. Advances in Neural Information Processing Systems, 2021, 34: 12116-12128.
[23] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[24] Niu T, Zhu S, Pang L, et al. Sentiment Analysis on Multi-View Social Data[C]// Proceedings of the International Conference on Multimedia Modeling. Springer, Cham, 2016: 15-27.
[25] Xu N, Mao W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 2399-2402.
[26] Vadicamo L, Carrara F, Cimino A, et al. Cross-Media Learning for Image Sentiment Analysis in the Wild[C]// Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017: 308-317.
[27] Xu N. Analyzing Multimodal Public Sentiment Based on Hierarchical Semantic Attentional Network[C]// Proceedings of the IEEE International Conference on Intelligence and Security Informatics. 2017: 152-154.
[28] Xu N, Mao W, Chen G. A Co-Memory Network for Multimodal Sentiment Analysis[C]// Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 929-932.
[29] Li L, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv:1908.03557.
[30] Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5100-5111.
[31] Tsai Y H, Bai S, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.
[1] Liu Yang, Ding Xingchen, Ma Lili, Wang Chunyang, Zhu Lifang. Usefulness Detection of Travel Reviews Based on Multi-dimensional Graph Convolutional Networks[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[2] Zhao Chaoyang, Zhu Guibo, Wang Jinqiao. The Inspiration Brought by ChatGPT to LLM and the New Development Ideas of Multi-modal Large Model[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[3] Wang Hao, Gong Lijuan, Zhou Zeyu, Fan Tao, Wang Yongsheng. Detecting Mis/Dis-information from Social Media with Semantic Enhancement[J]. 数据分析与知识发现, 2023, 7(2): 48-60.
[4] Wu Sisi, Ma Jing. Multi-task & Multi-modal Sentiment Analysis Model Based on Aware Fusion[J]. 数据分析与知识发现, 2023, 7(10): 74-84.
[5] Xie Hao,Mao Jin,Li Gang. Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion[J]. 数据分析与知识发现, 2021, 5(6): 103-114.
[6] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[7] Wang Yuzhu,Xie Jun,Chen Bo,Xu Xinying. Multi-modal Sentiment Analysis Based on Cross-modal Context-aware Attention[J]. 数据分析与知识发现, 2021, 5(4): 49-59.
[8] Zhu Lu, Deng Fang, Liu Kun, He Tingting, Liu Yuanyuan. Cross-Modal Retrieval Based on Semantic Auto-Encoder and Hash Learning[J]. 数据分析与知识发现, 2021, 5(12): 110-122.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn