Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 329-337    DOI: 10.11925/infotech.2096-3467.2021.0911
Current Issue | Archive | Adv Search |
Classifying Images of Intangible Cultural Heritages with Multimodal Fusion
Fan Tao,Wang Hao(),Li Yueyan,Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210023, China
Download: PDF (6418 KB)   HTML ( 14
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new method combining images and texual descriptions, aiming to improve the classification of Intangible Cultural Heritage (ICH) images. [Methods] We built the new model with multimodal fusion, which includes a fine-tuned deep pre-trained model for extracting visual semantic features, a BERT model for extracting textual features, a fusion layer for concatenating visual and textual features, and an output layer for predicting labels. [Results] We examined the proposed model with the national ICH project-New Year Prints to classify the Mianzu Prints, Taohuawu Prints, Yangjiabu Prints, and Yangliuqing Prints. We found that fine-tuning the convolutional layer strengthened the visual semantics features of the ICH images, and the F1 value for classification reached 72.028%. Compared with the baseline models, our method yielded the best results, with a F1 value of 77.574%. [Limitations] The proposed model was only tested on New Year Prints, which needs to be expanded to more ICH projects in the future. [Conclusions] Adding textual description features can improve the performance of ICH image classification. Fine-tuning convolutional layers in image deep pre-trained model can improve extraction of visual semantics features.

Key wordsDigital Humanities      Multimodal Classification      Image Classification     
Received: 25 August 2021      Published: 18 February 2022
ZTFLH:  G202  
Fund:National Natural Science Foundation of China(72074108);Fundamental Research Funds for the Central Universities(010814370113)
Corresponding Authors: Wang Hao,ORCID: 0000-0002-0131-0823     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong. Classifying Images of Intangible Cultural Heritages with Multimodal Fusion. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 329-337.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0911     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/329

An Example of ICH Image and Its Textual Description
ICH Image Classification Model Based on Multimodal Fusion
The Structure of FICM
The Distribution of the Number of New Year Prints Images and Textual Descriptions Based on the Type
模型 Precision (%) Recall (%) F1 (%)
block4_conv1 69.987 67.485 67.687
block4_conv2 72.675 71.432 71.684
block4_conv3 73.066 71.794 72.028
block4_conv4 68.549 68.149 68.119
fc 65.480 63.445 63.092
Fine-tuning Results of Convolutional Layers and fc Layers of block4 in FICM
原始图片 block5_conv1 block5_conv2 block5_conv3 block5_conv4
Visualization Results of Convolutional Layers in block5 of FICM
模型 Precision (%) Recall (%) F1 (%)
VGG19 69.696 68.814 68.721
CNN 65.399 62.624 62.997
SVM (V) 61.293 60.106 60.116
BERT 72.599 71.766 71.568
SVM (V+T) 75.748 73.690 73.885
ICMMF 78.813 77.113 77.574
Results Between ICMMF and Other Baseline Models
Classification Results of Different New Year Prints with Different Modalities
dropout F1 (%)
0.9 71.856
0.7 76.784
0.5 77.574
0.3 72.982
0.1 74.775
The Impact of dropout Value on the Performance of ICMMF Model
[1] 项兆伦. 非遗保护要见人见物见生活[N]. 人民日报, 2018-06-06(12).
[1] ( Xiang Zhaolun. Protecting Intangible Cultural Heritage Involving People, Things and Lives[N]. People Daily, 2018-06-06(12).)
[2] 文化和旅游部. 关于印发《“十四五”非物质文化遗产保护规划》的通知[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html.
[2] (Ministry of Culture and Tourism. Notice on Issuing the “Fourteenth Five-Year Plan for the Protection of Intangible Cultural Heritage”[OL]. [2021-06-23]. http://zwgk.mct.gov.cn/zfxxgkml/fwzwhyc/202106/t20210609_925092.html. )
[3] Do T N, Pham N K, Nguyen H H, et al. Stacking of SVMs for Classifying Intangible Cultural Heritage Images[C]// Proceedings of the 6th International Conference on Computer Science, Applied Mathematics and Applications. Springer, Cham, 2019: 186-196.
[4] Janković R. Machine Learning Models for Cultural Heritage Image Classification: Comparison Based on Attribute Selection[J]. Information, 2019, 11(1):12.
doi: 10.3390/info11010012
[5] Li Q C, Gkoumas D, Lioma C, et al. Quantum-Inspired Multimodal Fusion for Video Sentiment Analysis[J]. Information Fusion, 2021, 65:58-71.
doi: 10.1016/j.inffus.2020.08.006
[6] Abdu S A, Yousef A H, Salem A. Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey[J]. Information Fusion, 2021, 76:204-226.
doi: 10.1016/j.inffus.2021.06.003
[7] Ananthram A, Saravanakumar K K, Huynh J, et al. Multi-Modal Emotion Detection with Transfer Learning[OL]. arXiv Preprint, arXiv:2011.07065.
[8] Xu J, Li Z J, Huang F R, et al. Social Image Sentiment Analysis by Exploiting Multimodal Content and Heterogeneous Relations[J]. IEEE Transactions on Industrial Informatics, 2021, 17(4):2974-2982.
doi: 10.1109/TII.2020.3005405
[9] Huang F R, Zhang X M, Zhao Z H, et al. Image-Text Sentiment Analysis via Deep Multimodal Attentive Fusion[J]. Knowledge-Based Systems, 2019, 167:26-37.
doi: 10.1016/j.knosys.2019.01.019
[10] Campos V, Jou B, Giró-i-Nieto X. From Pixels to Sentiment: Fine-Tuning CNNS for Visual Sentiment Prediction[J]. Image and Vision Computing, 2017, 65:15-22.
doi: 10.1016/j.imavis.2017.01.011
[11] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Minneapolis, Minnesota: Association for Computational Linguistics, 2019: 4171-4186.
[12] 任秋彤, 王昊, 熊欣, 等. 融合GCN远距离约束的非遗戏剧术语抽取模型构建及其应用研究[J]. 数据分析与知识发现, 2021, 5(12):123-136.
[12] ( Ren Qiutong, Wang Hao, Xiong Xin, et al. Extracting Drama Terms with GCN Long-Distance Constrain[J]. Data Analysis and Knowledge Discovery, 2021, 5(12):123-136.)
[13] 刘浏, 秦天允, 王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现, 2020, 4(12):68-75.
[13] ( Liu Liu, Qin Tianyun, Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):68-75.)
[14] 朱学芳, 王若宸. 非遗图像语义信息本体构建及其关联数据存储和发布研究[J]. 现代情报, 2021, 41(6):54-63.
[14] ( Zhu Xuefang, Wang Ruochen. Research on ICH Image Semantic Information Ontology Construction of and Its Linked Data Storage & Publication[J]. Journal of Modern Information, 2021, 41(6):54-63.)
[15] Kulkarni U, Meena S M, Gurlahosur S V, et al. Classification of Cultural Heritage Sites Using Transfer Learning[C]// Proceedings of the 5th International Conference on Multimedia Big Data. IEEE, 2019: 391-397.
[16] Yunari N, Yuniarno E M, Purnomo M. Indonesian Batik Image Classification Using Statistical Texture Feature Extraction Gray Level Co-Occurrence Matrix (GLCM) and Learning Vector Quantization (LVQ)[J]. Journal of Telecommunication, Electronic and Computer Engineering, 2018, 10:67-71.
[17] Soleymani M, Garcia D, Jou B, et al. A Survey of Multimodal Sentiment Analysis[J]. Image and Vision Computing, 2017, 65:3-14.
doi: 10.1016/j.imavis.2017.08.003
[18] Majumder N, Hazarika D, Gelbukh A, et al. Multimodal Sentiment Analysis Using Hierarchical Fusion with Context Modeling[J]. Knowledge-Based Systems, 2018, 161:124-133.
doi: 10.1016/j.knosys.2018.07.041
[19] You Q Z, Luo J B, Jin H L, et al. Cross-Modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia[C]// Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 2016: 13-22.
[20] Majumder N, Poria S, Peng H Y, et al. Sentiment and Sarcasm Classification with Multitask Learning[J]. IEEE Intelligent Systems, 2019, 34(3):38-43.
doi: 10.1109/MIS.2019.2904691
[21] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[OL]. arXiv Preprint, arXiv: 1409.1556.
[22] Zhao Z Y, Zhu H Y, Xue Z H, et al. An Image-Text Consistency Driven Multimodal Sentiment Analysis Approach for Social Media[J]. Information Processing & Management, 2019, 56(6):102097.
doi: 10.1016/j.ipm.2019.102097
[23] Dashtipour K, Gogate M, Cambria E, et al. A Novel Context-Aware Multimodal Framework for Persian Sentiment Analysis[J]. Neurocomputing, 2021, 457:377-388.
doi: 10.1016/j.neucom.2021.02.020
[24] Poria S, Cambria E, Bajpai R, et al. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion[J]. Information Fusion, 2017, 37:98-125.
doi: 10.1016/j.inffus.2017.02.003
[25] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Deep Convolutional Neural Networks[J]. Communications of the ACM, 2017, 60(6):84-90.
doi: 10.1145/3065386
[26] Pérez Rosas V, Mihalcea R, Morency L P. Multimodal Sentiment Analysis of Spanish Online Videos[J]. IEEE Intelligent Systems, 2013, 28(3):38-45.
doi: 10.1109/MIS.2013.9
[27] Wang S, Manning C. Fast Dropout Training[C]// Proceedings of the 30th International Conference on International Conference on Machine Learning. PMLR, 2013: 118-126.
[28] Zhang X, Zou Y, Shi W. Dilated Convolution Neural Network with LeakyReLU for Environmental Sound Classification[C]// Proceedings of the 22nd International Conference on Digital Signal Processing. IEEE, 2017: 1-5.
[1] Zhou Zeyu, Wang Hao, Zhang Xiaoqin, Tao Fao, Ren Qiutong. Classification Model for Chinese Traditional Embroidery Based on Xception-TD[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[2] Li Gang, Zhang Ji, Mao Jin. Social Media Image Classification for Emergency Portrait[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
[3] Zhang Qi,Jiang Chuan,Ji Youshu,Feng Minxuan,Li Bin,Xu Chao,Liu Liu. Unified Model for Word Segmentation and POS Tagging of Multi-Domain Pre-Qin Literature[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[4] Wang Qian,Wang Dongbo,Li Bin,Xu Chao. Deep Learning Based Automatic Sentence Segmentation and Punctuation Model for Massive Classical Chinese Literature[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[5] Zhao Yuxiang,Lian Jingwen. Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities[J]. 数据分析与知识发现, 2021, 5(1): 36-55.
[6] Liang Jiwen,Jiang Chuan,Wang Dongbo. Chinese-English Sentence Alignment of Ancient Literature Based on Multi-feature Fusion[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[8] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[9] Haici Yang,Jun Wang. Visualizing Knowledge Graph of Academic Inheritance in Song Dynasty[J]. 数据分析与知识发现, 2019, 3(6): 109-116.
[10] Yue Yuan,Dongbo Wang,Shuiqing Huang,Bin Li. The Comparative Study of Different Tagging Sets on Entity Extraction of Classical Books[J]. 数据分析与知识发现, 2019, 3(3): 57-65.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn