Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (2/3): 329-337     https://doi.org/10.11925/infotech.2096-3467.2021.0911
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于多模态融合的非遗图片分类研究*
范涛,王昊(),李跃艳,邓三鸿
南京大学信息管理学院 南京 210023
Classifying Images of Intangible Cultural Heritages with Multimodal Fusion
Fan Tao,Wang Hao(),Li Yueyan,Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210023, China
全文: PDF (6418 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对当前非遗图片分类不足的问题,提出结合非遗图片和文本描述,以多模态融合的方式进行非遗图片分类研究。【方法】 构建基于多模态融合的非遗图片分类模型(Image Classification Model with Multimodal Fusion,ICMMF),其主要由用于非遗图片视觉语义特征抽取的微调深度预训练模型、对文本特征进行抽取的BERT模型、融合视觉语义特征和文本描述特征的多模态融合层和预测类别输出层组成。【结果】 以国家级非遗项目——年画为例,对我国四大年画(绵竹年画、杨柳青年画、杨家埠年画及桃花坞年画)进行分类。将ICMMF模型在建立的数据集中进行实证,实验结果表明,对图片深度预训练模型中的卷积层进行微调,能够改善非遗图片的视觉语义特征,分类的F1值最高达72.028%。在同基线模型的对比中,ICMMF模型表现最优,F1值达77.574%。【局限】 ICMMF模型仅在年画数据集上进行了测试,未在更广泛的非遗项目中进行验证。【结论】 结合文本描述,以多模态的方式进行非遗图片分类,能够有效提升模型的分类性能;同时,对图片深度预训练模型中的卷积层进行微调,能够改善抽取的视觉语义特征。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
范涛
王昊
李跃艳
邓三鸿
关键词 数字人文多模态分类图片分类    
Abstract

[Objective] This paper proposes a new method combining images and texual descriptions, aiming to improve the classification of Intangible Cultural Heritage (ICH) images. [Methods] We built the new model with multimodal fusion, which includes a fine-tuned deep pre-trained model for extracting visual semantic features, a BERT model for extracting textual features, a fusion layer for concatenating visual and textual features, and an output layer for predicting labels. [Results] We examined the proposed model with the national ICH project-New Year Prints to classify the Mianzu Prints, Taohuawu Prints, Yangjiabu Prints, and Yangliuqing Prints. We found that fine-tuning the convolutional layer strengthened the visual semantics features of the ICH images, and the F1 value for classification reached 72.028%. Compared with the baseline models, our method yielded the best results, with a F1 value of 77.574%. [Limitations] The proposed model was only tested on New Year Prints, which needs to be expanded to more ICH projects in the future. [Conclusions] Adding textual description features can improve the performance of ICH image classification. Fine-tuning convolutional layers in image deep pre-trained model can improve extraction of visual semantics features.

Key wordsDigital Humanities    Multimodal Classification    Image Classification
收稿日期: 2021-08-25      出版日期: 2022-02-18
ZTFLH:  G202  
基金资助:*国家自然科学基金面上项目(72074108);中央高校基本科研业务费专项资金资助项目的研究成果之一(010814370113)
通讯作者: 王昊,ORCID: 0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn
引用本文:   
范涛, 王昊, 李跃艳, 邓三鸿. 基于多模态融合的非遗图片分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 329-337.
Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong. Classifying Images of Intangible Cultural Heritages with Multimodal Fusion. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 329-337.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0911      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/329
Fig.1  非遗图片及文本描述示例
Fig.2  基于多模态融合的非遗图片分类模型
Fig.3  FICM结构
Table 1  年画类别图片及文本描述数量分布
模型 Precision (%) Recall (%) F1 (%)
block4_conv1 69.987 67.485 67.687
block4_conv2 72.675 71.432 71.684
block4_conv3 73.066 71.794 72.028
block4_conv4 68.549 68.149 68.119
fc 65.480 63.445 63.092
Table 2  对FICM中block4的各卷积层及全连接层(fc)微调的结果
原始图片 block5_conv1 block5_conv2 block5_conv3 block5_conv4
Table 3  FICM的block5中各卷积层的可视化结果
模型 Precision (%) Recall (%) F1 (%)
VGG19 69.696 68.814 68.721
CNN 65.399 62.624 62.997
SVM (V) 61.293 60.106 60.116
BERT 72.599 71.766 71.568
SVM (V+T) 75.748 73.690 73.885
ICMMF 78.813 77.113 77.574
Table 4  ICMMF同各基线模型的对比结果
Fig.4  利用不同模态的各类别年画分类结果
dropout F1 (%)
0.9 71.856
0.7 76.784
0.5 77.574
0.3 72.982
0.1 74.775
Table 5  dropout值对ICMMF模型性能的影响
[1] 项兆伦. 非遗保护要见人见物见生活[N]. 人民日报, 2018-06-06(12).
[1] ( Xiang Zhaolun. Protecting Intangible Cultural Heritage Involving People, Things and Lives[N]. People Daily, 2018-06-06(12).)
[2] 文化和旅游部. 关于印发《“十四五”非物质文化遗产保护规划》的通知[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html.
[2] (Ministry of Culture and Tourism. Notice on Issuing the “Fourteenth Five-Year Plan for the Protection of Intangible Cultural Heritage”[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html. )
[3] Do T N, Pham N K, Nguyen H H, et al. Stacking of SVMs for Classifying Intangible Cultural Heritage Images[C]// Proceedings of the 6th International Conference on Computer Science, Applied Mathematics and Applications. Springer, Cham, 2019: 186-196.
[4] Janković R. Machine Learning Models for Cultural Heritage Image Classification: Comparison Based on Attribute Selection[J]. Information, 2019, 11(1):12.
doi: 10.3390/info11010012
[5] Li Q C, Gkoumas D, Lioma C, et al. Quantum-Inspired Multimodal Fusion for Video Sentiment Analysis[J]. Information Fusion, 2021, 65:58-71.
doi: 10.1016/j.inffus.2020.08.006
[6] Abdu S A, Yousef A H, Salem A. Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey[J]. Information Fusion, 2021, 76:204-226.
doi: 10.1016/j.inffus.2021.06.003
[7] Ananthram A, Saravanakumar K K, Huynh J, et al. Multi-Modal Emotion Detection with Transfer Learning[OL]. arXiv Preprint, arXiv:2011.07065.
[8] Xu J, Li Z J, Huang F R, et al. Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations[J]. IEEE Transactions on Industrial Informatics, 2021, 17(4):2974-2982.
doi: 10.1109/TII.2020.3005405
[9] Huang F R, Zhang X M, Zhao Z H, et al. Image-Text Sentiment Analysis via Deep Multimodal Attentive Fusion[J]. Knowledge-Based Systems, 2019, 167:26-37.
doi: 10.1016/j.knosys.2019.01.019
[10] Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-Tuning CNNS for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65:15-22.
doi: 10.1016/j.imavis.2017.01.011
[11] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[12] 任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12):123-136.
[12] ( Ren Qiutong, Wang Hao, Xiong Xin, et al. Extracting Drama Terms with GCN Long-Distance Constrain[J]. Data Analysis and Knowledge Discovery, 2021, 5(12):123-136.)
[13] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12):68-75.
[13] ( Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):68-75.)
[14] 朱学芳, 王若宸. 非遗图像语义信息本体构建及其关联数据存储和发布研究[J]. 现代情报, 2021, 41(6):54-63.
[14] ( Zhu Xuefang, Wang Ruochen. Research on ICH Image Semantic Information Ontology Construction of and Its Linked Data Storage & Publication[J]. Journal of Modern Information, 2021, 41(6):54-63.)
[15] Kulkarni U, Meena S M, Gurlahosur S V, et al. Classification of Cultural Heritage Sites Using Transfer Learning[C]// Proceedings of the 5th International Conference on Multimedia Big Data. IEEE, 2019: 391-397.
[16] Yunari N, Yuniarno E M, Purnomo M. Indonesian Batik Image Classification Using Statistical Texture Feature Extraction Gray Level Co-Occurrence Matrix (GLCM) and Learning Vector Quantization (LVQ)[J]. Journal of Telecommunication, Electronic and Computer Engineering, 2018, 10:67-71.
[17] Soleymani M, Garcia D, Jou B, et al. A Survey of Multimodal Sentiment Analysis[J]. Image and Vision Computing, 2017, 65:3-14.
doi: 10.1016/j.imavis.2017.08.003
[18] Majumder N, Hazarika D, Gelbukh A, et al. Multimodal Sentiment Analysis Using Hierarchical Fusion with Context Modeling[J]. Knowledge-Based Systems, 2018, 161:124-133.
doi: 10.1016/j.knosys.2018.07.041
[19] You Q Z, Luo J B, Jin H L, et al. Cross-Modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia[C]// Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 13-22.
[20] Majumder N, Poria S, Peng H Y, et al. Sentiment and Sarcasm Classification with Multitask Learning[J]. IEEE Intelligent Systems, 2019, 34(3):38-43.
doi: 10.1109/MIS.2019.2904691
[21] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[OL]. arXiv Preprint, arXiv: 1409.1556.
[22] Zhao Z Y, Zhu H Y, Xue Z H, et al. An Image-Text Consistency Driven Multimodal Sentiment Analysis Approach for Social Media[J]. Information Processing & Management, 2019, 56(6):102097.
doi: 10.1016/j.ipm.2019.102097
[23] Dashtipour K, Gogate M, Cambria E, et al. A Novel Context-Aware Multimodal Framework for Persian Sentiment Analysis[J]. Neurocomputing, 2021, 457:377-388.
doi: 10.1016/j.neucom.2021.02.020
[24] Poria S, Cambria E, Bajpai R, et al. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion[J]. Information Fusion, 2017, 37:98-125.
doi: 10.1016/j.inffus.2017.02.003
[25] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM, 2017, 60(6):84-90.
doi: 10.1145/3065386
[26] Pérez Rosas V, Mihalcea R, Morency L P. Multimodal Sentiment Analysis of Spanish Online Videos[J]. IEEE Intelligent Systems, 2013, 28(3):38-45.
doi: 10.1109/MIS.2013.9
[27] Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on International Conference on Machine Learning. PMLR, 2013: 118-126.
[28] Zhang X, Zou Y, Shi W. Dilated Convolution Neural Network with LeakyReLU for Environmental Sound Classification[C]// Proceedings of the 22nd International Conference on Digital Signal Processing. IEEE, 2017: 1-5.
[1] 周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[2] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[3] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[4] 纪有书, 王东波, 黄水清. 基于词对齐的古汉语同义词自动抽取研究*——以前四史典籍为例[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
[5] 赵宇翔,练靖雯. 数字人文视域下文化遗产众包研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[6] 梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[8] 刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取*[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[9] 杨海慈,王军. 宋代学术师承知识图谱的构建与可视化[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[10] 袁悦,王东波,黄水清,李斌. 不同词性标记集在典籍实体抽取上的差异性探究*[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn