|
|
Multimodal Sentiment Analysis Model Integrating Multi-features and Attention Mechanism |
Lyu Xueqiang1,Tian Chi1,Zhang Le1(),Du Yifan1,Zhang Xu1,Cai Zangtai2 |
1Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China 2The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Qinghai Normal University, Xining 810008, China |
|
|
Abstract [Objective] This paper proposes a multimodal sentiment analysis model integrating multiple features and attention mechanisms. It addresses the insufficient extraction of multimodal features and inadequate interaction of intra-modal and inter-modal information in existing models. [Methods] In multimodal feature extraction, we enhanced the features of body movements, gender, and age of individuals in the video modality. For the text modality, we integrated BERT-based character-level and word-level semantic vectors. Therefore, we enriched the low-level features of multimodal data. We also utilized self-attention and cross-modal attention mechanisms to integrate intra-modal and inter-modal information. We concatenated the modal features and employed a soft attention mechanism to allocate attention weight to each feature. Finally, we generated the sentiment classification results through fully connected layers. [Results] We examined the proposed model on the public dataset (CH-SIMS) and the Hot Public Opinion Comments Videos (HPOC) dataset constructed in this paper. Compared with the Self-MM model, our model improved the binary classification accuracy, tri-class classification accuracy, and F1 value by 1.83%, 1.74%, and 0.69% on the CH-SIMS dataset, and 1.03%, 0.94%, and 0.79% on the HPOC dataset. [Limitations] The person’s scene in the video may change constantly, and different scenes may contain different emotional information. Our model does not integrate the scene information of the person. [Conclusions] The proposed model enriches the low-level features of multimodal data and improves the effectiveness of sentimental analysis.
|
Received: 11 January 2023
Published: 27 May 2024
|
|
Fund:Key Program of the National Language Commission of China(ZDI145-10);Science and Technology Plan of the Beijing Municipal Commission of Education(KM202311232001);Innovation Platform Construction Special Project of Qinghai Province(2022-ZJ-T02) |
Corresponding Authors:
Zhang Le,ORCID:0000-0002-9620-511X,E-mail: zhangle@bistu.edu.cn。
|
[1] |
Gönen M, Alpaydin E. Multiple Kernel Learning Algorithms[J]. Journal of Machine Learning Research, 2011, 12: 2211-2268.
|
[2] |
Ghahramani Z. Learning Dynamic Bayesian Networks[M]//Adaptive Processing of Sequences and Data Structures. Berlin, Heidelberg: Springer, 1998: 168-197.
|
[3] |
Chen Y. Convolutional Neural Network for Sentence Classification[D]. Waterloo: University of Waterloo, 2015.
|
[4] |
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
|
[5] |
Tang D Y, Qin B, Liu T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1422-1432.
|
[6] |
Poria S, Cambria E, Hazarika D, et al. Context-Dependent Sentiment Analysis in User-Generated Videos[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 873-883.
|
[7] |
Williams J, Kleinegesse S, Comanescu R, et al. Recognizing Emotions in Video Using Multimodal DNN Feature Fusion[C]// Proceedings of Grand Challenge and Workshop on Human Multimodal Language. 2018: 11-19.
|
[8] |
Zadeh A, Chen M H, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1103-1114.
|
[9] |
Hou M, Tang J J, Zhang J H, et al. Deep Multimodal Multilinear Fusion with High-Order Polynomial Pooling[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. 2019: 12156-12166.
|
[10] |
Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2018: 2247-2256.
|
[11] |
马超, 李纲, 陈思菁, 等. 基于多模态数据语义融合的旅游在线评论有用性识别研究[J]. 情报学报, 2020, 39(2): 199-207.
|
[11] |
(Ma Chao, Li Gang, Chen Sijing, et al. Research on Usefulness Recognition of Tourism Online Reviews Based on Multimodal Data Semantic Fusion[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(2): 199-207.)
|
[12] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
|
[13] |
Tsai Y H H, Bai S J, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.
|
[14] |
Hazarika D, Zimmermann R, Poria S. MISA: Modality-Invariant and-Specific Representations for Multimodal Sentiment Analysis[C]// Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1122-1131.
|
[15] |
张昱, 张海军, 刘雅情, 等. 基于双向掩码注意力机制的多模态情感分析[J]. 数据分析与知识发现, 2023, 7(4): 46-55.
|
[15] |
(Zhang Yu, Zhang Haijun, Liu Yaqing, et al. Multimodal Sentiment Analysis Based on Bidirectional Mask Attention Mechanism[J]. Data Analysis and Knowledge Discovery, 2023, 7(4): 46-55.)
|
[16] |
Lindsay P H, Norman D A. Human Information Processing: An Introduction to Psychology[M]. London: Academic Press, 2013.
|
[17] |
张峰, 李希城, 董春茹, 等. 基于深度情感唤醒网络的多模态情感分析与情绪识别[J]. 控制与决策, 2022, 37(11): 2984-2992.
|
[17] |
(Zhang Feng, Li Xicheng, Dong Chunru, et al. Deep Emotional Arousal Network for Multimodal Sentiment Analysis and Emotion Recognition[J]. Control and Decision, 2022, 37(11): 2984-2992.)
|
[18] |
王旭阳, 董帅, 石杰. 复合层次融合的多模态情感分析[J]. 计算机科学与探索, 2023, 17(1): 198-208.
doi: 10.3778/j.issn.1673-9418.2111004
|
[18] |
(Wang Xuyang, Dong Shuai, Shi Jie. Multimodal Sentiment Analysis with Composite Hierarchical Fusion[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(1): 198-208.)
doi: 10.3778/j.issn.1673-9418.2111004
|
[19] |
McFee B, Raffel C, Liang D W, et al. librosa: Audio and Music Signal Analysis in Python[C]// Proceedings of the 14th Python in Science Conference. 2015: 18-25.
|
[20] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume1(Long and Short Papers). 2019: 4171-4186.
|
[21] |
Niu Y L, Xie R B, Liu Z Y, et al. Improved Word Representation Learning with Sememes[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2017: 2049-2058.
|
[22] |
Yu W M, Xu H, Meng F Y, et al. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-Grained Annotation of Modality[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 3718-3727.
|
[23] |
Baltrusaitis T, Zadeh A, Lim Y C, et al. OpenFace 2.0: Facial Behavior Analysis Toolkit[C]// Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition. 2018: 59-66.
|
[24] |
Xu Y F, Zhang J, Zhang Q M, et al. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation [OL]. arXiv Preprint, arXiv: 2204.12484.
|
[25] |
Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context[C]// Proceedings of the 13th European Conference on Computer Vision. 2014: 740-755.
|
[26] |
Park S, Shim H S, Chatterjee M, et al. Computational Analysis of Persuasiveness in Social Multimedia: A Novel Dataset and Multimodal Prediction Approach[C]// Proceedings of the 16th ACM International Conference on Multimodal Interaction. 2014: 50-57.
|
[27] |
Zeng Y, Mai S, Hu H F. Which is Making the Contribution: Modulating Unimodal and Cross-Modal Dynamics for Multimodal Sentiment Analysis[C]// Findings of the Association for Computational Linguistics:EMNLP 2021. 2021: 1262-1274.
|
[28] |
Zadeh A, Liang P P, Mazumder N, et al. Memory Fusion Network for Multi-view Sequential Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5634-5641.
|
[29] |
Yu W M, Xu H, Yuan Z Q, et al. Learning Modality-Specific Representations with Self-Supervised Multi-task Learning for Multimodal Sentiment Analysis[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021: 10790-10797.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|