[Objective] This paper develops a multi-task and multi-modal sentiment analysis model based on aware fusion, aiming to sufficiently use context information, as well as address the modality-invariant and modality-specific issues. [Methods] We established multi-modal, text, acoustic, and image sentiment analysis tasks. We extracted their features using BERT, wav2vec2.0, and openface2.0 models, which were processed by the self-attention layer and sent to the aware fusion layer for multi-modal feature fusion. Finally, we categorized the single-modal and multi-modal information using Softmax. We also introduced the loss function of the homoscedastic uncertainty to assign weights to different tasks automatically. [Results] Compared with the baseline method, the proposed model improved the accuracy and F1 value by 1.59% and 1.67% on CH-SIMS, and 0.55% and 0.67% on CMU-MOSI. The ablation experiment showed that the accuracy and F1 value of multi-task learning were 4.08% and 4.18% higher than those of single-task learning. [Limitations] We need to examine the new model’s performance on large-scale data sets. [Conclusions] The model can effectively reduce noise and improve multi-modal fusion. The multi-task learning framework could also achieve better performance.
吴思思, 马静. 基于感知融合的多任务多模态情感分析模型*[J]. 数据分析与知识发现, 2023, 7(10): 74-84.
Wu Sisi, Ma Jing. Multi-task & Multi-modal Sentiment Analysis Model Based on Aware Fusion. Data Analysis and Knowledge Discovery, 2023, 7(10): 74-84.
Cambria E, Hazarika D, Poria S, et al. Benchmarking Multimodal Sentiment Analysis[C]// Proceedings of International Conference on Computational Linguistics and Intelligent Text Processing. Springer, Cham, 2017: 166-179.
[2]
Bahdanau D, Cho K H, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[C]// Proceedings of the 3rd International Conference on Learning Representations. 2015.
[3]
Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[OL]. arXiv Preprint, arXiv: 1707.07250.
[4]
Gu Y, Yang K N, Fu S Y, et al. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2018: 2225-2235.
[5]
Wang Y S, Shen Y, Liu Z, et al. Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019: 7216-7223.
[6]
Pham H, Liang P P, Manzini T, et al. Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence and the 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019: 6892-6899.
[7]
Hazarika D, Zimmermann R, Poria S. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis[C]// Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1122-1131.
(Pan Jiahui, He Zhipeng, Li Zina, et al. A Review of Multimodal Emotion Recognition[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 633-645.)
[9]
Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient Low-Rank Multimodal Fusion with Modality-Specific Factors[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2018: 2247-2256.
[10]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[11]
Zadeh A, Liang P P, Mazumder N, et al. Memory Fusion Network for Multi-view Sequential Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence. 2018: 5634-5641.
[12]
Tsai Y H H, Bai S J, Liang P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 6558-6569.
[13]
Sahay S, Okur E, Kumar S H, et al. Low Rank Fusion Based Transformers for Multimodal Sequences[C]// Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020: 29-34.
[14]
Han W, Chen H, Poria S. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis[C]// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021: 9180-9192.
[15]
Li Z, Xu B, Zhu C H, et al. CLMLF: A Contrastive Learning and Multi-Layer Fusion Method for Multimodal Sentiment Detection[OL]. arXiv Preprint, arXiv: 2204.05515.
[16]
Akhtar M S, Chauhan D S, Ghosal D, et al. Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis[OL]. arXiv Preprint, arXiv: 1905.05812.
[17]
Yu W M, Xu H, Meng F Y, et al. CH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 3718-3727.
[18]
Chauhan D S, Dhanush S R, Ekbal A, et al.Sentiment and Emotion Help Sarcasm? A Multi-task Learning Framework for Multi-modal Sarcasm, Sentiment and Emotion Analysis[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 4351-4360.
[19]
Yu W M, Xu H, Yuan Z Q, et al. Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2021: 10790-10797.
[20]
Yang B S, Li J, Wong D F, et al. Context-Aware Self-Attention Networks[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2019: 387-394.
[21]
Kendall A, Gal Y, Cipolla R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 7482-7491.
[22]
Cui Y M, Che W X, Liu T, et al. Pre-training with Whole Word Masking for Chinese BERT[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3504-3514.
doi: 10.1109/TASLP.2021.3124365
[23]
Baevski A, Zhou H, Mohamed A, et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020: 12449-12460.
[24]
Baltrusaitis T, Zadeh A, Lim Y C, et al. OpenFace 2.0: Facial Behavior Analysis Toolkit[C]// Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition. 2018: 59-66.
[25]
Lu J S, Batra D, Parikh D, et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[OL]. arXiv Preprint, arXiv: 1908.02265.
[26]
Zadeh A, Zellers R, Pincus E, et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos[OL]. arXiv Preprint, arXiv: 1606.06259.