[Objective] This paper designs an SC-Attention fusion mechanism,aiming to improve the low prediction accuracy and difficult fusion of multimodal features in the existing detection models for multimodal sarcasm. [Methods] First, we used the CLIP and RoBERTa models to extract features from pictures, picture attributes, and texts. Then, we combined the SC-Attention mechanism with SENet’s attention mechanism to establish the Co-Attention mechanism and fuse multi-modal features. Third, we re-allocated attention feature weights by the original modals. Finally, we input features to the full connection layers to detect sarcasm. [Results] The accuracy and F1 of the proposed model reached 93.71% and 91.68%, which were 10.27 and 11.5 percentage point higher than the existing ones. [Limitations] We need to examine our model with more data sets. [Conclusions] The proposed model reduces information redundancy and feature loss, which effectively improves the accuracy of multimodal sarcasm detection.
Fig.3 添加图像属性模态的例子 (图像属性:blue, sky,white,cloud,many,文本:“What bad weather!”)
Fig.4 CLIP模型文本编码器结构
Fig.5 SC-Attention机制结构
Fig.6 Parallel-Attention机制结构
Fig.7 SENet的注意力机制结构
分类
训练集
测试集
无讽刺性
8 642
1 918
讽刺性
11 174
2 901
合计
19 816
4 819
Table 1 数据集标注结果
参数
值
词向量维度
768
图像向量维度
768
Dropout
0.15
学习率
0.000 1
批处理
32
优化函数
Adam
损失函数
CrossEntropy Loss
Table 2 实验参数设置
模型
VIT
RN50
RoBERTa
BERT
CLIP模型文本编码器
Co-Attention
SC
VRC-FC
√
√
√
√
VRC-CF
√
√
√
VRC-TFN
√
√
√
VRC-SC
√
√
√
√
√
RRC-FC
√
√
√
√
RRC-CF
√
√
√
RRC-TFN
√
√
√
RRC-SC
√
√
√
√
√
VBC-SC
√
√
√
√
√
Table 3 对比实验
模型
Acc/%
P/%
R/%
F1/%
Text(BERT)
80.40
75.67
77.80
76.72
Text(RoBERTa)
90.11
87.16
89.67
88.40
Image(ViT)
61.85
54.06
54.01
54.03
Image(ResNet50×16)
59.56
52.62
50.32
51.44
Attribute
78.87
74.69
74.26
74.48
Concat(I+T)
91.93
89.02
90.11
89.56
Concat(I+A)
78.91
74.02
75.81
78.91
Concat(A+T)
91.23
88.76
89.34
89.05
VRC-SC(I+T+A)
93.71
90.28
93.13
91.68
Table 4 模态消融实验结果
模型
Acc/%
P/%
R/%
F1/%
VRC-FC
91.33
89.17
92.04
90.23
VRC-CF
90.14
87.43
88.72
88.07
VRC-TFN
91.98
89.45
91.68
90.55
VRC-SC
93.71
90.28
93.13
91.68
RRC-FC
91.11
88.15
91.64
89.48
RRC-CF
89.75
87.23
86.89
87.06
RRC-TFN
91.33
88.14
90.07
89.09
RRC-SC
92.42
89.48
92.66
90.56
分层融合模型
83.44
76.57
84.15
80.18
Table 5 不同的特征融合机制的实验结果
模型
Acc/%
P/%
R/%
F1/%
VBC-SC
83.83
77.45
83.46
80.34
VRC-SC
93.71
90.28
93.13
91.68
RRC-SC
92.42
89.48
92.66
90.56
Table 6 不同特征提取模型的实验结果
Fig.8 不同Dropout值的准确率对比
训练方案
优化器
学习率衰减策略
Epoch数
Acc/%
方案1
SGD
后期手动调整
129
93.42
方案2
Adam
余弦退火
112
93.71
方案3
RMSprop
自适应调整学习率
139
93.35
Table 7 三种优化器以及学习率衰减策略的实验结果
Fig.9 讽刺推文的例子
[1]
Cai Y T, Cai H Y, Wan X J. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2506-2515.
[2]
Joshi A, Tripathi V, Patel K, et al. Are Word Embedding-Based Features Useful for Sarcasm Detection?[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016:1006-1011.
[3]
Poria S, Cambria E, Hazarika D, et al. A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks[OL]. arXiv Preprint, arXiv:1610.08815.
[4]
Potamias R A, Siolas G, Stafylopatis A G. A Transformer-Based Approach to Irony And sarcasm Detection[J]. Neural Computing and Applications, 2020, 32(23): 17309-17320.
doi: 10.1007/s00521-020-05102-3
( He Jun, Zhang Caiqing, Li Xiaozhen, et al. Survey of Research on Multimodal Fusion Technology for Deep Learning[J]. Computer Engineering, 2020, 46(5): 1-11.)
[6]
Zadeh A, Chen M, Poria S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1103-1114.
[7]
Bedi M, Kumar S, Akhtar M S, et al. Multi-Modal Sarcasm Detection and Humor Classification in Code-Mixed Conversations[J]. IEEE Transactions on Affective Computing, DOI: 10.1109/TAFFC.2021.3083522.
doi: 10.1109/TAFFC.2021.3083522
[8]
Handoyo A T, Suhartono D. Sarcasm Detection in Twitter: Performance Impact While Using Data Augmentation: Word Embeddings[OL]. arXiv Preprint, arXiv: 2108.09924.
[9]
Swami S, Khandelwal A, Singh V, et al. A Corpus of English-Hindi Code-Mixed Tweets for Sarcasm Detection[OL]. arXiv Preprint, arXiv:1805.11869.
[10]
Castro S, Hazarika D, Pérez-Rosas V, et al. Towards Multimodal Sarcasm Detection (An_Obviously_Perfect Paper)[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 4619-4629.
[11]
Bharti S K, Babu K S, Jena S K. Harnessing Online News for Sarcasm Detection in Hindi Tweets[C]// Proceedings of International Conference on Pattern Recognition and Machine Intelligence. 2017: 679-686.
[12]
Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]// Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763.
[13]
Liu Z, Lin W, Shi Y, et al. A Robustly Optimized BERT Pre-training Approach with Post-training[C]// Proceedings of Chinese Computational Linguistics:20th China National Conference. 2021: 471-484.
( Meng Xiangrui, Yang Wenzhong, Wang Ting. Survey of Sentiment Analysis Based on Image and Text Fusion[J]. Journal of Computer Applications, 2021, 41(2): 307-317.)
doi: 10.11772/j.issn.1001-9081.2020060923
[15]
Lu J S, Batra D, Yang J W, et al. Hierarchical Question-Image Co-Attention for Visual Question Answering[OL]. arXiv Preprint, arXiv: 1606.00061.
[16]
Hu J, Shen L, Sun G. Squeeze-and-EXCITATION Networks[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 7132-7141.