Review of Multimodal Named Entity Recognition Studies
Han Pu1,2(),Chen Wenqi1
1School of Management, Nanjing University of Posts and Telecommunications, Nanjing 210003, China 2Provincial Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210023, China
[Objective] This paper reviews multimodal named entity recognition research to provide references for future studies. [Coverage] We selected 83 representative papers using “multimodal named entity recognition”, “multimodal information extraction”,and “multimodal knowledge graph” as the search terms for the Web of Science, IEEE Xplore, ACM digital library, and CNKI databases. [Methods] We summarized the multimodal named entity recognition research in four aspects: concepts, feature representation, fusion strategies, and pre-trained models. We also identified existing problems and future research directions. [Results] Multimodal named entity recognition studies focus on modal feature representation and fusion. It made some progress in the field of social media. They need to improve multimodal fine-grained feature extraction and semantic association mapping methods to enhance the models’ generalization and interpretability. [Limitations] There is insufficient literature directly using multimodal named entity recognition as a research topic. [Conclusions] Our study provides new ideas to expand the applications of multimodal learning, break the modal barriers, and bridge the semantic gaps.
Moon S, Neves L, Carvalho V. Multimodal Named Entity Recognition for Short Social Media Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018: 852-860.
[2]
Zhang Q, Fu J L, Liu X Y, et al. Adaptive Co-attention Network for Named Entity Recognition in Tweets[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5674-5681.
(Wu Youzheng, Li Haoran, Yao Ting, et al. A Survey of Multimodal Information Processing Frontiers: Application, Fusion and Pre-training[J]. Journal of Chinese Information Processing, 2022, 36(5): 1-20.)
[4]
Yao W, Yoshinaga N. Visually-Guided Named Entity Recognition by Grounding Words with Images via Dense Retrieval[C]// Proceedings of the Association for Natural Language Processing. 2022: 1361-1365.
[5]
Elliott D, Frank S, Hasler E. Multilingual Image Description with Neural Sequence Models[OL]. arXiv Preprint, arXiv:1510.04709.
[6]
Antol S, Agrawal A, Lu J S, et al. VQA: Visual Question Answering[C]// Proceedings of 2015 IEEE International Conference on Computer Vision. 2015: 2425-2433.
[7]
Zhu X R, Li Z X, Wang X D, et al. Multi-modal Knowledge Graph Construction and Application: A Survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(2): 715-735.
[8]
Baltrušaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443.
doi: 10.1109/TPAMI.2018.2798607
pmid: 29994351
[9]
Liang P P, Zadeh A, Morency L P. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions[OL]. arXiv Preprint, arXiv:2209.03430.
(He Jun, Zhang Caiqing, Li Xiaozhen, et al. Survey of Research on Multimodal Fusion Technology for Deep Learning[J]. Computer Engineering, 2020, 46(5): 1-11.)
doi: 10.19678/j.issn.1000-3428.0057370
(Wang Huiru, Li Xiuhong, Li Zhe, et al. Survey of Multimodal Pre-training Models[J]. Journal of Computer Applications, 2023, 43(4): 991-1004.)
[12]
Lu D, Neves L, Carvalho V, et al. Visual Attention Model for Name Tagging in Multimodal Social Media[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers). 2018: 1990-1999.
[13]
Arshad O, Gallo I, Nawaz S, et al. Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition[C]// Proceedings of 2019 International Conference on Document Analysis and Recognition. IEEE, 2019: 337-342.
[14]
Yu J, Jiang J, Yang L, et al. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 3342-2252.
[15]
Asgari-Chenaghlu M, Feizi-Derakhshi M R, Farzinvash L, et al. CWI: A Multimodal Deep Learning Approach for Named Entity Recognition from Social Media Using Character, Word and Image Features[J]. Neural Computing and Applications, 2022, 34(3): 1905-1922.
[16]
Zhang D, Wei S Z, Li S S, et al. Multi-Modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021: 14347-14355.
[17]
Zheng C M, Wu Z W, Wang T, et al. Object-Aware Multimodal Named Entity Recognition in Social Media Posts with Adversarial Learning[J]. IEEE Transactions on Multimedia, 2020, 23: 2520-2532.
[18]
Sun L, Wang J Q, Su Y D, et al. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-Image Relation for Multimodal NER[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 1852-1862.
[19]
Sun L, Wang J Q, Zhang K, et al. RpBERT: A Text-Image Relation Propagation-Based BERT Model for Multimodal NER[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021: 13860-13868.
[20]
Liu L P, Wang M L, Zhang M Z, et al. UAMNer: Uncertainty-Aware Multimodal Named Entity Recognition in Social Media Posts[J]. Applied Intelligence, 2022, 52(4): 4109-4125.
[21]
Sui D B, Tian Z K, Chen Y B, et al. A Large-Scale Chinese Multimodal NER Dataset with Speech Clues[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 2807-2818.
(Fan Tao, Wang Hao, Chen Yuetong. Research on Multimodal Named Entity Recognition of Local History Based on Deep Transfer Learning[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(4): 412-423.)
[23]
Xuan Z Y, Bao R, Jiang S Y. FGN: Fusion Glyph Network for Chinese Named Entity Recognition[C]// Proceedings of China Conference on Knowledge Graph and Semantic Computing. 2020: 28-40.
[24]
Chen D W, Li Z X, Gu B B, et al. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge[C]// Proceedings of International Conference on Database Systems for Advanced Applications. 2021: 186-201.
[25]
Wang X W, Tian J F, Gui M, et al. PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition[C]// Proceedings of International Conference on Database Systems for Advanced Applications. 2022: 297-305.
[26]
Xu B, Huang S Z, Sha C F, et al. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition[C]// Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 2022: 1215-1223.
[27]
Liu Y, Huang S B, Li R S, et al. USAF: Multimodal Chinese Named Entity Recognition Using Synthesized Acoustic Features[J]. Information Processing & Management, 2023, 60(3): 103290.
[28]
Tian Y, Sun X, Yu H F, et al. Hierarchical Self-Adaptation Network for Multimodal Named Entity Recognition in Social Media[J]. Neurocomputing, 2021, 439: 12-21.
[29]
Wang X Y, Gui M, Jiang Y, et al. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition[OL]. arXiv Preprint, arXiv:2112.06482.
(Li Xiaoteng, Zhang Panpan, Gou Zhinan, et al. Multi-Modal Named Entity Recognition Method Based on Multi-Task Learning[J]. Computer Engineering, 2023, 49(4):114-119.)
doi: 10.19678/j.issn.1000-3428.0064087
[31]
Huang Y, Du C Z, Xue Z H, et al. What Makes Multi-modal Learning Better than Single[C]// Proceedings of the 35th Conference on Neural Information Processing Systems. 2021.
[32]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
(Li Daiyi, Zhang Xiaowen, Yan Li. A Multimodal Name Entity Recognition Method Based on Heterogeneous Graph Network[J]. Journal of Chinese Computer Systems: 1-10. [2023-07-24]. http://kns.cnki.net/kcms/detail/21.1106.TP.20230711.1048.002.html.)
[34]
Kattenborn T, Leitloff J, Schiefer F, et al. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24-49.
[35]
Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[36]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:810.04805.
[37]
Chen S G, Aguilar G, Neves L, et al. Can Images Help Recognize Entities? A Study of The Role of Images for Multimodal NER[OL]. arXiv Preprint, arXiv:2010.12712.
[38]
He K M, Zhang X Y, Ren S Q, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
[39]
Zhu T G, Wang Y, Li H R, et al. Multimodal Joint Attribute Prediction and Value Extraction for E-Commerce Product[OL]. arXiv Preprint, arXiv:2009.07162.
[40]
Hu X M. Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy[C]// Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023: 3488.
[41]
Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
doi: 10.1109/TPAMI.2016.2577031
pmid: 27295650
[42]
Kiela D, Bottou L. Learning Image Embeddings Using Convolutional Neural Networks for Improved Multi-Modal Semantics[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 36-45.
[43]
Anderson P, He X D, Buehler C, et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
[44]
Dong L H, Xu S, Xu B. Speech-Transformer: A No-recurrence Sequence-to-Sequence Model for Speech Recognition[C]// Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. 2018: 5884-5888.
[45]
Purwins H, Li B, Virtanen T, et al. Deep Learning for Audio Signal Processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2): 206-219.
doi: 10.1109/JSTSP.2019.2908700
(Hu Fengsong, Zhang Xuan. Speaker Recognition Method Based on Mel Frequency Cepstrum Coefficient and Inverted Mel Frequency Cepstrum Coefficient[J]. Journal of Computer Applications, 2012, 32(9):2542-2544.)
[47]
Zhang X, Yuan J L, Li L, et al. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition[C]// Proceedings of the 16th ACM International Conference on Web Search and Data Mining. 2023: 958-966.
[48]
Liu P P, Li H, Ren Y M, et al. A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments[OL]. arXiv Preprint, arXiv: 2305.08372.
[49]
Khare Y, Bagal V, Mathew M, et al. MMBERT: Multimodal BERT Pretraining for Improved Medical VQA[C]// Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging. 2021: 1033-1036.
[50]
Jiang Y G, Wu Z X, Wang J, et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(2): 352-364.
[51]
Habibian A, Mensink T, Snoek C G M. Video2vec Embeddings Recognize Events when Examples are Scarce[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(10): 2089-2103.
doi: 10.1109/TPAMI.2016.2627563
pmid: 27849523
[52]
Fukui A, Park D H, Yang D, et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[OL]. arXiv Preprint, arXiv:1606.01847.
[53]
Lu J S, Yang J W, Batra D, et al. Hierarchical Question-Image Co-Attention for Visual Question Answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 289-297.
[54]
Zadeh A, Liang P P, Poria S, et al. Multi-attention Recurrent Network for Human Communication Comprehension[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5642-5649.
[55]
Pang L, Ngo C W. Mutlimodal Learning with Deep Boltzmann Machine for Emotion Prediction in User Generated Videos[C]// Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. 2015: 619-622.
[56]
Martínez H P, Yannakakis G N. Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals[C]// Proceedings of the 16th International Conference on Multimodal Interaction. 2014: 34-41.
[57]
Rasiwasia N, Pereira J C, Coviello E, et al. A New Approach to Cross-Modal Multimedia Retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia. 2010: 251-260.
[58]
Wang B K, Yang Y, Xu X, et al. Adversarial Cross-Modal Retrieval[C]// Proceedings of the 25th ACM International Conference on Multimedia. 2017: 154-162.
[59]
Wang X W, Ye J B, Li Z X, et al. CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention[C]// Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022: 1-6.
[60]
Yin Y J, Meng F D, Su J S, et al. A Novel Graph-Based Multi-modal Fusion Encoder for Neural Machine Translation[OL]. arXiv Preprint, arXiv:2007.08742.
[61]
Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis[C]// Proceedings of 2016 IEEE 16th International Conference on Data Mining. 2016: 439-448.
[62]
Zadeh A, Zellers R, Pincus E, et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[63]
Zhou B H, Zhang Y, Song K H, et al. A Span-Based Multimodal Variational Autoencoder for Semi-supervised Multimodal Named Entity Recognition[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 6293-6302.
[64]
Nojavanasghari B, Gopinath D, Koushik J, et al. Deep Multimodal Fusion for Persuasiveness Prediction[C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016: 284-288.
[65]
Kampman O, Barezi E J, Bertero D, et al. Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction[OL]. arXiv Preprint, arXiv:1805.00705.
[66]
Wu Z W, Zheng C M, Cai Y, et al. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts[C]// Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1038-1046.
[67]
Zhao X Y, Tang B Z. Multimodal Named Entity Recognition via Co-attention-Based Method with Dynamic Visual Concept Expansion[C]// Proceedings of International Conference on Neural Information Processing. 2021: 476-487.
[68]
Kim J H, Jun J, Zhang B T. Bilinear Attention Networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018: 1571-1581.
[69]
Zhao G, Dong G T, Shi Y D, et al. Entity-Level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition[C]// Findings of the Association for Computational Linguistics:EMNLP 2022. 2022: 6345-6350.
(Yin Qiyue, Huang Yan, Zhang Junge, et al. Survey on Deep Learning Based Cross-Modal Retrieval[J]. Journal of Image and Graphics, 2021, 26(6): 1368-1388.)
(Li Zhiyi, Huang Zifeng, Xu Xiaomian. A Review of the Cross-Modal Retrieval Model and Feature Extraction Based on Representation Learning[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(4): 422-435.)
(Tang Yue, Ma Jing. A Rumor Detection Method Based on Enhance Adversarial Network and Multimodal Fusion[J]. Information Science, 2022, 40(6): 108-114.)
[73]
Li L H, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv: 1908.03557.
[74]
Su W J, Zhu X Z, Cao Y, et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations[OL]. arXiv Preprint, arXiv: 1908.08530.
[75]
Chen Y C, Li L J, Yu L C, et al. UNITER: UNiversal Image-TExt Representation Learning[C]// Proceedings of European Conference on Computer Vision. 2020: 104-120.
[76]
Lu J S, Batra D, Parikh D, et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 13-23.
[77]
Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[OL]. arXiv Preprint, arXiv:1908.07490.
[78]
Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]// Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763.
[79]
Li G, Duan N, Fang Y J, et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-training[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020: 11336-11344.
[80]
Chen X L, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server[OL]. arXiv Preprint, arXiv: 1504.00325.
[81]
Dou Z Y, Xu Y C, Gan Z, et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18145-18155.
[82]
Yuan L, Cai Y, Wang J, et al. Joint Multimodal Entity-Relation Extraction Based on Edge-Enhanced Graph Alignment Network and Word-Pair Relation Tagging[OL]. arXiv Preprint, arXiv: 2211.15028.
[83]
Wang P, Chen X H, Shang Z Y, et al. Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning[J]. IEICE Transactions on Information and Systems, 2023, 106(4): 545-555.