|
|
A Feature-Enhanced Multi-modal Emotion Recognition Model Integrating Knowledge and Res-ViT |
Yang Ruyun,Ma Jing() |
College of Economics and Management, Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China |
|
|
Abstract [Objective] This paper aims to enhance the quality of multi-modal feature extraction and improve the accuracy of netizen sentiment recognition for multi-modal public opinion. [Methods] First, we extracted features of the text modality using RoBERTa and enhanced them with the knowledge phrase representation dictionary. Then, we proposed a Res-ViT model for the graph modality, combining ResNet and Vision Transformer. Finally, we fused multi-modal features with Transformer encoders and fed the representations to the fully connected layer for sentiment recognition. [Results] We evaluated our model using the MVSA-Multiple dataset and achieved an accuracy of 71.66% and an F1 score of 69.42% for sentiment recognition. These improvements were 2.22% and 0.59% over the best scores of the baseline methods. [Limitations] More research is needed to examine the model with other datasets to verify its generalizability and robustness. [Conclusions] The proposed model could more effectively extract and fuse multi-modal features and improve the accuracy of sentiment recognition.
|
Received: 26 September 2022
Published: 22 March 2023
|
|
Fund:National Natural Science Foundation of China(72174086);Postgraduate Research & Practice Innovation Program of Nanjing University of Aeronautics and Astronautics(xcxjh20220910) |
Corresponding Authors:
Ma Jing,ORCID:0000-0001-8472-2581,E-mail:majing5525@126.com。
|
[1] |
Zhao J, Gui X, Zhang X. Deep Convolution Neural Networks for Twitter Sentiment Analysis[J]. IEEE Access, 2018, 6: 23253-23260.
doi: 10.1109/ACCESS.2017.2776930
|
[2] |
Rehman A U, Malik A K, Raza B, et al. A Hybrid CNN-LSTM Model for Improving Accuracy of Movie Reviews Sentiment Analysis[J]. Multimedia Tools and Applications, 2019, 78(18): 26597-26613.
doi: 10.1007/s11042-019-07788-7
|
[3] |
Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, et al. ABCDM: An Attention-Based Bidirectional CNN-RNN Deep Model for Sentiment Analysis[J]. Future Generation Computer Systems, 2021, 115: 279-294.
doi: 10.1016/j.future.2020.08.005
|
[4] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. ACM, 2017: 5998-6008.
|
[5] |
Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
|
[6] |
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2019: 4171-4186.
|
[7] |
Sun Y, Wang S, Li Y, et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. arXiv preprint, arXiv:1904.09223.
|
[8] |
Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling Language Representation with Knowledge Graph[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2020: 2901-2908.
|
[9] |
Ke P, Ji H, Liu S, et al. SentiLARE: Sentiment-aware Language Representation Learning with Linguistic Knowledge[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 6975-6988.
|
[10] |
Zhong P, Wang D, Miao C. Knowledge-enriched Transformer for Emotion Detection in Textual Conversations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 165-176.
|
[11] |
Tian H, Gao C, Xiao X, et al. SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 4067-4076.
|
[12] |
Tenney I, Xia P, Chen B, et al. What do You Learn from Context?Probing for Sentence Structure in Contextualized Word Representations[C]// Proceedings of the 7th International Conference on Learning Representations. 2019.
|
[13] |
Roberts A, Raffel C, Shazeer N. How Much Knowledge Can You Pack into the Parameters of a Language Model?[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020: 5418-5426.
|
[14] |
Jia J, Wu S, Wang X, et al. Can We Understand Van Gogh’s Mood?Learning to Infer Affects from Images in Social Networks[C]// Proceedings of the 20th ACM International Conference on Multimedia. 2012: 857-860.
|
[15] |
Borth D, Ji R, Chen T, et al. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs[C]// Proceedings of the 21st ACM International Conference on Multimedia. 2013: 223-232.
|
[16] |
Xu C, Cetintas S, Lee K C, et al. Visual Sentiment Prediction with Deep Convolutional Neural Networks[OL]. arXiv Preprint, arXiv:1411.5731.
|
[17] |
Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65: 15-22.
doi: 10.1016/j.imavis.2017.01.011
|
[18] |
He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
|
[19] |
Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141.
|
[20] |
Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module[C]// Proceedings of the European Conference on Computer Vision (ECCV). 2018: 3-19.
|
[21] |
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[OL]. arXiv Preprint, arXiv: 2010.11929.
|
[22] |
Raghu M, Unterthiner T, Kornblith S, et al. Do Vision Transformers See Like Convolutional Neural Networks?[J]. Advances in Neural Information Processing Systems, 2021, 34: 12116-12128.
|
[23] |
Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
|
[24] |
Niu T, Zhu S, Pang L, et al. Sentiment Analysis on Multi-View Social Data[C]// Proceedings of the International Conference on Multimedia Modeling. Springer, Cham, 2016: 15-27.
|
[25] |
Xu N, Mao W. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 2399-2402.
|
[26] |
Vadicamo L, Carrara F, Cimino A, et al. Cross-Media Learning for Image Sentiment Analysis in the Wild[C]// Proceedings of the IEEE International Conference on Computer Vision Workshops. 2017: 308-317.
|
[27] |
Xu N. Analyzing Multimodal Public Sentiment Based on Hierarchical Semantic Attentional Network[C]// Proceedings of the IEEE International Conference on Intelligence and Security Informatics. 2017: 152-154.
|
[28] |
Xu N, Mao W, Chen G. A Co-Memory Network for Multimodal Sentiment Analysis[C]// Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018: 929-932.
|
[29] |
Li L, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv:1908.03557.
|
[30] |
Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 5100-5111.
|
[31] |
Tsai Y H, Bai S, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|