基于多模态融合的非遗图片分类研究<sup>*</sup>

doi:10.11925/infotech.2096-3467.2021.0911

数据分析与知识发现

2022, Vol. 6

Issue (2/3): 329-337 https://doi.org/10.11925/infotech.2096-3467.2021.0911

专辑

本期目录 | 过刊浏览 | 高级检索

基于多模态融合的非遗图片分类研究^*

范涛,王昊(

),李跃艳,邓三鸿

南京大学信息管理学院南京 210023

Classifying Images of Intangible Cultural Heritages with Multimodal Fusion

Fan Tao,Wang Hao(

),Li Yueyan,Deng Sanhong

School of Information Management, Nanjing University, Nanjing 210023, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (6418 KB) HTML ( 14 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 针对当前非遗图片分类不足的问题,提出结合非遗图片和文本描述,以多模态融合的方式进行非遗图片分类研究。【方法】 构建基于多模态融合的非遗图片分类模型（Image Classification Model with Multimodal Fusion,ICMMF）,其主要由用于非遗图片视觉语义特征抽取的微调深度预训练模型、对文本特征进行抽取的BERT模型、融合视觉语义特征和文本描述特征的多模态融合层和预测类别输出层组成。【结果】 以国家级非遗项目——年画为例,对我国四大年画（绵竹年画、杨柳青年画、杨家埠年画及桃花坞年画）进行分类。将ICMMF模型在建立的数据集中进行实证,实验结果表明,对图片深度预训练模型中的卷积层进行微调,能够改善非遗图片的视觉语义特征,分类的F1值最高达72.028%。在同基线模型的对比中,ICMMF模型表现最优,F1值达77.574%。【局限】 ICMMF模型仅在年画数据集上进行了测试,未在更广泛的非遗项目中进行验证。【结论】 结合文本描述,以多模态的方式进行非遗图片分类,能够有效提升模型的分类性能;同时,对图片深度预训练模型中的卷积层进行微调,能够改善抽取的视觉语义特征。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	范涛
	王昊
	李跃艳
	邓三鸿

关键词 ：数字人文, 多模态分类, 图片分类

Abstract：

[Objective] This paper proposes a new method combining images and texual descriptions, aiming to improve the classification of Intangible Cultural Heritage (ICH) images. [Methods] We built the new model with multimodal fusion, which includes a fine-tuned deep pre-trained model for extracting visual semantic features, a BERT model for extracting textual features, a fusion layer for concatenating visual and textual features, and an output layer for predicting labels. [Results] We examined the proposed model with the national ICH project-New Year Prints to classify the Mianzu Prints, Taohuawu Prints, Yangjiabu Prints, and Yangliuqing Prints. We found that fine-tuning the convolutional layer strengthened the visual semantics features of the ICH images, and the F1 value for classification reached 72.028%. Compared with the baseline models, our method yielded the best results, with a F1 value of 77.574%. [Limitations] The proposed model was only tested on New Year Prints, which needs to be expanded to more ICH projects in the future. [Conclusions] Adding textual description features can improve the performance of ICH image classification. Fine-tuning convolutional layers in image deep pre-trained model can improve extraction of visual semantics features.

Key words： Digital Humanities Multimodal Classification Image Classification

收稿日期: 2021-08-25 出版日期: 2022-02-18

ZTFLH:

G202

基金资助:*国家自然科学基金面上项目(72074108);中央高校基本科研业务费专项资金资助项目的研究成果之一(010814370113)

通讯作者: 王昊,ORCID： 0000-0002-0131-0823 E-mail: ywhaowang@nju.edu.cn

引用本文:

范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong. Classifying Images of Intangible Cultural Heritages with Multimodal Fusion. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 329-337.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0911 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/329

Fig.1 非遗图片及文本描述示例

Fig.2 基于多模态融合的非遗图片分类模型

Fig.3 FICM结构

Table 1 年画类别图片及文本描述数量分布

Table 2 对FICM中block4的各卷积层及全连接层（fc）微调的结果

Table 3 FICM的block5中各卷积层的可视化结果

Table 4 ICMMF同各基线模型的对比结果

Fig.4 利用不同模态的各类别年画分类结果

Table 5 dropout值对ICMMF模型性能的影响

[1]	项兆伦. 非遗保护要见人见物见生活[N]. 人民日报, 2018-06-06(12).
[1]	( Xiang Zhaolun. Protecting Intangible Cultural Heritage Involving People, Things and Lives[N]. People Daily, 2018-06-06(12).)
[2]	文化和旅游部. 关于印发《“十四五”非物质文化遗产保护规划》的通知[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html.
[2]	(Ministry of Culture and Tourism. Notice on Issuing the “Fourteenth Five-Year Plan for the Protection of Intangible Cultural Heritage”[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html. )
[3]	Do T N, Pham N K, Nguyen H H, et al. Stacking of SVMs for Classifying Intangible Cultural Heritage Images[C]// Proceedings of the 6th International Conference on Computer Science, Applied Mathematics and Applications. Springer, Cham, 2019: 186-196.
[4]	Janković R. Machine Learning Models for Cultural Heritage Image Classification: Comparison Based on Attribute Selection[J]. Information, 2019, 11(1):12. doi: 10.3390/info11010012
[5]	Li Q C, Gkoumas D, Lioma C, et al. Quantum-Inspired Multimodal Fusion for Video Sentiment Analysis[J]. Information Fusion, 2021, 65:58-71. doi: 10.1016/j.inffus.2020.08.006
[6]	Abdu S A, Yousef A H, Salem A. Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey[J]. Information Fusion, 2021, 76:204-226. doi: 10.1016/j.inffus.2021.06.003
[7]	Ananthram A, Saravanakumar K K, Huynh J, et al. Multi-Modal Emotion Detection with Transfer Learning[OL]. arXiv Preprint, arXiv:2011.07065.
[8]	Xu J, Li Z J, Huang F R, et al. Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations[J]. IEEE Transactions on Industrial Informatics, 2021, 17(4):2974-2982. doi: 10.1109/TII.2020.3005405
[9]	Huang F R, Zhang X M, Zhao Z H, et al. Image-Text Sentiment Analysis via Deep Multimodal Attentive Fusion[J]. Knowledge-Based Systems, 2019, 167:26-37. doi: 10.1016/j.knosys.2019.01.019
[10]	Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-Tuning CNNS for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65:15-22. doi: 10.1016/j.imavis.2017.01.011
[11]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[12]	任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12):123-136.
[12]	( Ren Qiutong, Wang Hao, Xiong Xin, et al. Extracting Drama Terms with GCN Long-Distance Constrain[J]. Data Analysis and Knowledge Discovery, 2021, 5(12):123-136.)
[13]	刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12):68-75.
[13]	( Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):68-75.)
[14]	朱学芳, 王若宸. 非遗图像语义信息本体构建及其关联数据存储和发布研究[J]. 现代情报, 2021, 41(6):54-63.
[14]	( Zhu Xuefang, Wang Ruochen. Research on ICH Image Semantic Information Ontology Construction of and Its Linked Data Storage & Publication[J]. Journal of Modern Information, 2021, 41(6):54-63.)
[15]	Kulkarni U, Meena S M, Gurlahosur S V, et al. Classification of Cultural Heritage Sites Using Transfer Learning[C]// Proceedings of the 5th International Conference on Multimedia Big Data. IEEE, 2019: 391-397.
[16]	Yunari N, Yuniarno E M, Purnomo M. Indonesian Batik Image Classification Using Statistical Texture Feature Extraction Gray Level Co-Occurrence Matrix (GLCM) and Learning Vector Quantization (LVQ)[J]. Journal of Telecommunication, Electronic and Computer Engineering, 2018, 10:67-71.
[17]	Soleymani M, Garcia D, Jou B, et al. A Survey of Multimodal Sentiment Analysis[J]. Image and Vision Computing, 2017, 65:3-14. doi: 10.1016/j.imavis.2017.08.003
[18]	Majumder N, Hazarika D, Gelbukh A, et al. Multimodal Sentiment Analysis Using Hierarchical Fusion with Context Modeling[J]. Knowledge-Based Systems, 2018, 161:124-133. doi: 10.1016/j.knosys.2018.07.041
[19]	You Q Z, Luo J B, Jin H L, et al. Cross-Modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia[C]// Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 13-22.
[20]	Majumder N, Poria S, Peng H Y, et al. Sentiment and Sarcasm Classification with Multitask Learning[J]. IEEE Intelligent Systems, 2019, 34(3):38-43. doi: 10.1109/MIS.2019.2904691
[21]	Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[OL]. arXiv Preprint, arXiv: 1409.1556.
[22]	Zhao Z Y, Zhu H Y, Xue Z H, et al. An Image-Text Consistency Driven Multimodal Sentiment Analysis Approach for Social Media[J]. Information Processing & Management, 2019, 56(6):102097. doi: 10.1016/j.ipm.2019.102097
[23]	Dashtipour K, Gogate M, Cambria E, et al. A Novel Context-Aware Multimodal Framework for Persian Sentiment Analysis[J]. Neurocomputing, 2021, 457:377-388. doi: 10.1016/j.neucom.2021.02.020
[24]	Poria S, Cambria E, Bajpai R, et al. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion[J]. Information Fusion, 2017, 37:98-125. doi: 10.1016/j.inffus.2017.02.003
[25]	Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM, 2017, 60(6):84-90. doi: 10.1145/3065386
[26]	Pérez Rosas V, Mihalcea R, Morency L P. Multimodal Sentiment Analysis of Spanish Online Videos[J]. IEEE Intelligent Systems, 2013, 28(3):38-45. doi: 10.1109/MIS.2013.9
[27]	Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on International Conference on Machine Learning. PMLR, 2013: 118-126.
[28]	Zhang X, Zou Y, Shi W. Dilated Convolution Neural Network with LeakyReLU for Environmental Sound Classification[C]// Proceedings of the 22nd International Conference on Digital Signal Processing. IEEE, 2017: 1-5.

[1]	周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建^*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[2]	张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建^*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[3]	王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究^*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[4]	纪有书, 王东波, 黄水清. *基于词对齐的古汉语同义词自动抽取研究^——以前四史典籍为例**[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
[5]	赵宇翔,练靖雯. 数字人文视域下文化遗产众包研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[6]	梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究^*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7]	徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[8]	刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取^*[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[9]	杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[10]	袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究^*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.

Viewed

Full text

Abstract

Cited

Shared

Discussed