[Objective] This paper extracts users’ opinions from videos to analyze their sentiments with the help of multi-modal methods. [Methods] First, we introduced bimodal and trimodal context information to obtain the interactions data among text, visual and audio. Then, we used attention mechanism to filter redundant information. Finally, we conducted sentiment analysis with the processed data. [Results] We examined the proposed method with MOSEI dataset. The accuracy and F1 value of sentiment classification reached 80.27% and 79.23%, which were 0.47% and 0.87% higher than the best results of the benchmark method. The mean absolute error of the regression analysis was reduced to 0.66. [Limitations] There was overfitting issue in model training due to the small size of MOSI dataset, which limited the effects of sentiment prediction. [Conclusions] The proposed model uses the interaction among different modalities and effectively improves the accuracy of sentiment prediction.
( Zhang Yazhou, Rong Lu, Song Dawei, et al. A Survey on Multimodal Sentiment Analysis[J]. Pattern Recognition and Artificial Intelligence, 2020,33(5):426-438.)
Morency L P, Mihalcea R, Doshi P. Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web[C]// Proceeding of the 13th International Conference on Multimodal Interfaces. Alicante, Spain: ACM, 2011: 169-176.
Poria S, Cambria E, Hazarika D, et al. Context-dependent Sentiment Analysis in User-generated Videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: ACL, 2017: 873-883.
( Tan Ying, Zhang Jin, Xia Lixin. A Survey of Sentiment Analysis on Social Media[J]. Data Analysis and Knowledge Discovery, 2020,4(1):1-11.)
Glodek M, Tschechne S, Layher G, et al. Multiple Classifier Systems for the Classification of Audio-visual Emotional States[C]//Proceedings of the 4th International Conference on Affective Computing and Intelligent Interaction. Berlin, Heidelberg: Springer, 2011: 359-368.
Cai G Y, Xia B B. Convolutional Neural Networks for Multimedia Sentiment Analysis[C]//Proceedings of the 4th CCF Conference on Natural Language Processing and Chinese Computing. Berlin, Heidelberg: Springer, 2015: 159-167.
Zadeh A, Zellers R, Pincus E, et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages[J]. IEEE Intelligent Systems, 2016,31(6):82-88.
Atrey P K, Hossain M A, El Saddik A, et al. Multimodal Fusion for Multimedia Analysis: A Survey[J]. Multimedia Systems, 2010,16(6):345-379.
Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1103-1114.
Zadeh A, Liang P P, Mazumder N, et al. Memory Fusion Network for Multi-view Sequential Learning[C]// Proceedings of the 2018 AAAI Conference on Artificial Intelligence. 2018: 5634-5641.
Zadeh A, Liang P P, Poria S, et al. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 2236-2246.
Ghosal D, Akhtar M S, Chauhan D, et al. Contextual Inter-modal Attention for Multi-modal Sentiment Analysis[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 3454-3466.
Nojavanasghari B, Gopinath D, Koushik J, et al. Deep Multimodal Fusion for Persuasiveness Prediction[C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016: 284-288.
Wollmer M, Weninger F, Knaup T, et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-visual Context[J]. IEEE Intelligent Systems, 2013,28(3):46-53.
Cho K, van Merriënboer B, Gulcehre C, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1724-1734.
He K M, Zhang X Y, Ren S Q, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
Karpathy A, Toderici G, Shetty S, et al. Large-scale Video Classification with Convolutional Neural Networks[C]// Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014: 1725-1732.
Ji S W, Xu W, Yang M, et al. 3D Convolutional Neural Networks for Human Action Recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012,35(1):221-231.
Eyben F, Wöllmer M, Schuller B. Opensmile: The Munich Versatile and Fast Open-source Audio Feature Extractor[C]// Proceedings of the 18th ACM International Conference on Multimedia. 2010: 1459-1462.
Degottex G, Kane J, Drugman T, et al. COVAREP-A Collaborative Voice Analysis Repository for Speech Technologies[C]// Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014: 960-964.
Majumder N, Hazarika D, Gelbukh A, et al. Multimodal Sentiment Analysis Using Hierarchical Fusion with Context Modeling[J]. Knowledge-based Systems, 2018,161:124-133.
Zadeh A, Liang P P, Poria S, et al. Multi-attention Recurrent Network for Human Communication Comprehension[C]// Proceedings of the 2018 AAAI Conference on Artificial Intelligence. 2018: 5642-5649.
Poria S, Cambria E, Gelbukh A. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-level Multimodal Sentiment Analysis[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 2539-2544.
Pérez-Rosas V, Mihalcea R, Morency L P. Utterance-Level Multimodal Sentiment Analysis[C]// Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 2013: 973-982.
Poria S, Cambria E, Hazarika D, et al. Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis[C]// Proceedings of the 2017 IEEE International Conference on Data Mining. 2017: 1033-1038.