[Objective] This paper aims to enhance the quality of multi-modal feature extraction and improve the accuracy of netizen sentiment recognition for multi-modal public opinion. [Methods] First, we extracted features of the text modality using RoBERTa and enhanced them with the knowledge phrase representation dictionary. Then, we proposed a Res-ViT model for the graph modality, combining ResNet and Vision Transformer. Finally, we fused multi-modal features with Transformer encoders and fed the representations to the fully connected layer for sentiment recognition. [Results] We evaluated our model using the MVSA-Multiple dataset and achieved an accuracy of 71.66% and an F1 score of 69.42% for sentiment recognition. These improvements were 2.22% and 0.59% over the best scores of the baseline methods. [Limitations] More research is needed to examine the model with other datasets to verify its generalizability and robustness. [Conclusions] The proposed model could more effectively extract and fuse multi-modal features and improve the accuracy of sentiment recognition.
杨茹芸, 马静. 一种融合知识与Res-ViT的特征增强多模态情感识别模型*[J]. 数据分析与知识发现, 2023, 7(11): 14-25.
Yang Ruyun, Ma Jing. A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT. Data Analysis and Knowledge Discovery, 2023, 7(11): 14-25.
Rehman A U, Malik A K, Raza B, et al. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis[J]. Multimedia Tools and Applications, 2019, 78(18): 26597-26613.
doi: 10.1007/s11042-019-07788-7
[3]
Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, et al. ABCDM: An Attention-Based Bidirectional CNN-RNN Deep Model for Sentiment Analysis[J]. Future Generation Computer Systems, 2021, 115: 279-294.
doi: 10.1016/j.future.2020.08.005
[4]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. ACM, 2017: 5998-6008.
[5]
Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
[6]
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
[7]
Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv preprint, arXiv:1904.09223.
[8]
Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling Language Representation with Knowledge Graph[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 2901-2908.
[9]
Ke P, Ji H, Liu S, et al. SentiLARE: Sentiment-aware Language Representation Learning with Linguistic Knowledge[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6975-6988.
[10]
Zhong P, Wang D, Miao C. Knowledge-enriched Transformer for Emotion Detection in Textual Conversations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 165-176.
[11]
Tian H, Gao C, Xiao X, et al. SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 4067-4076.
[12]
Tenney I, Xia P, Chen B, et al. What do You Learn from Context?Probing for Sentence Structure in Contextualized Word Representations[C]// Proceedings of the 7th International Conference on Learning Representations. 2019.
[13]
Roberts A, Raffel C, Shazeer N. How Much Knowledge Can You Pack into the Parameters of a Language Model?[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 5418-5426.
[14]
Jia J, Wu S, Wang X, et al. Can We Understand Van Gogh’s Mood?Learning to Infer Affects from Images in Social Networks[C]// Proceedings of the 20th ACM International Conference on Multimedia. 2012: 857-860.
[15]
Borth D, Ji R, Chen T, et al. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs[C]// Proceedings of the 21st ACM International Conference on Multimedia. 2013: 223-232.
[16]
Xu C, Cetintas S, Lee K C, et al. Visual Sentiment Prediction with Deep Convolutional Neural Networks[OL]. arXiv Preprint, arXiv:1411.5731.
[17]
Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65: 15-22.
doi: 10.1016/j.imavis.2017.01.011
[18]
He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[19]
Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141.
[20]
Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.
[21]
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[OL]. arXiv Preprint, arXiv: 2010.11929.
[22]
Raghu M, Unterthiner T, Kornblith S, et al. Do Vision Transformers See Like Convolutional Neural Networks?[J]. Advances in Neural Information Processing Systems, 2021, 34: 12116-12128.
[23]
Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[24]
Niu T, Zhu S, Pang L, et al. Sentiment Analysis on Multi-View Social Data[C]// Proceedings of the International Conference on Multimedia Modeling. Springer, Cham, 2016: 15-27.
[25]
Xu N, Mao W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 2399-2402.
[26]
Vadicamo L, Carrara F, Cimino A, et al. Cross-Media Learning for Image Sentiment Analysis in the Wild[C]// Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017: 308-317.
[27]
Xu N. Analyzing Multimodal Public Sentiment Based on Hierarchical Semantic Attentional Network[C]// Proceedings of the IEEE International Conference on Intelligence and Security Informatics. 2017: 152-154.
[28]
Xu N, Mao W, Chen G. A Co-Memory Network for Multimodal Sentiment Analysis[C]// Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 929-932.
[29]
Li L, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv:1908.03557.
[30]
Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5100-5111.
[31]
Tsai Y H, Bai S, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.