Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (4): 50-63     https://doi.org/10.11925/infotech.2096-3467.2023.0488
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
多模态命名实体识别研究进展*
韩普1,2(),陈文祺1
1南京邮电大学管理学院 南京 210003
2数据工程与知识服务省高校重点实验室(南京大学) 南京 210023
Review of Multimodal Named Entity Recognition Studies
Han Pu1,2(),Chen Wenqi1
1School of Management, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
2Provincial Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210023, China
全文: PDF (1174 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 梳理归纳多模态命名实体识别研究成果,为后续相关研究提供参考与借鉴。【文献范围】 在Web of Science、IEEE Xplore、ACM Digital Library、中国知网数据库中,以“多模态命名实体识别”“多模态信息抽取”“多模态知识图谱”为检索词进行文献检索,共筛选出83篇代表性文献。【方法】 从概念、特征表示、融合策略和预训练模型4个方面对多模态命名实体识别研究进行总结论述,指出现存问题和未来研究方向。【结果】 多模态命名实体识别目前主要围绕模态特征表示和融合两个方面展开且在社交媒体领域取得了一定进展,需要进一步改进多模态细粒度特征提取和语义关联映射方法以提升模型的泛化性和可解释性。【局限】 直接以多模态命名实体识别为研究主题的文献数量较少,在支撑综述结果方面存在局限性。【结论】 针对多模态命名实体识别亟需解决的问题展望未来发展趋势,为进一步拓宽多模态学习在下游任务应用的研究范畴、破解模态壁垒和语义鸿沟提供了新思路。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
韩普
陈文祺
关键词 多模态命名实体识别特征表示多模态融合多模态预训练    
Abstract

[Objective] This paper reviews multimodal named entity recognition research to provide references for future studies. [Coverage] We selected 83 representative papers using “multimodal named entity recognition”, “multimodal information extraction”,and “multimodal knowledge graph” as the search terms for the Web of Science, IEEE Xplore, ACM digital library, and CNKI databases. [Methods] We summarized the multimodal named entity recognition research in four aspects: concepts, feature representation, fusion strategies, and pre-trained models. We also identified existing problems and future research directions. [Results] Multimodal named entity recognition studies focus on modal feature representation and fusion. It made some progress in the field of social media. They need to improve multimodal fine-grained feature extraction and semantic association mapping methods to enhance the models’ generalization and interpretability. [Limitations] There is insufficient literature directly using multimodal named entity recognition as a research topic. [Conclusions] Our study provides new ideas to expand the applications of multimodal learning, break the modal barriers, and bridge the semantic gaps.

Key wordsMultimodal Named Entity Recognition    Feature Representation    Multimodal Fusion    Multimodal Pre-training
收稿日期: 2023-05-23      出版日期: 2024-03-01
ZTFLH:  TP391  
  G35  
基金资助:* 国家社会科学基金项目(22BTQ096);江苏高校青蓝工程和江苏省研究生科研创新计划基金项目(KYCX23_0930)
通讯作者: 韩普,ORCID:0000-0001-5867-4292,E-mail: hanpu@njupt.edu.cn。   
引用本文:   
韩普, 陈文祺. 多模态命名实体识别研究进展*[J]. 数据分析与知识发现, 2024, 8(4): 50-63.
Han Pu, Chen Wenqi. Review of Multimodal Named Entity Recognition Studies. Data Analysis and Knowledge Discovery, 2024, 8(4): 50-63.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0488      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I4/50
Fig.1  多模态命名实体识别研究框架(以文本-图像为例)
Fig.2  两种多模态融合架构示意图
Fig.3  多模态预训练架构(以图像-文本为例)
架构种类 原理 优点 缺点 代表性模型
单流架构 将文本和视觉特征组合在一起,馈入单个Transformer块,通过合并注意力融合多模态输入 参数效率更高 无法解耦,需成对送入编码 VisualBERT[73]、VL-BERT[74]、 UNITER[75]
双流架构 将文本和视觉特征独立输入不同的编码块,不共享参数,通常使用交叉注意力用于实现跨模态交互 各模态的网络深度不同,独立编码,自由组合;可快速解耦 参数量大 ViLBERT[76]、LXMERT[77]、CLIP[78]
Table1  单流架构和双流架构的对比
模型 模型输入 组成模块 预训练任务
RIVA[18] 乘法拼接图像
和文本模态
(1) 图文关系门控网络
(2) 注意力引导视觉上下文网络
(3) 视觉语言上下文网络
(1)图片文本关系分类
(2)掩码区域预测
RpBERT[19] 加法拼接图像
和文本模态
(1) 图文关系分类模块
(2) 视觉语言学习模块
(1)图片文本关系分类
(2)关系传播机制
Table2  多模态命名实体识别中多模态预训练模型
[1] Moon S, Neves L, Carvalho V. Multimodal Named Entity Recognition for Short Social Media Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018: 852-860.
[2] Zhang Q, Fu J L, Liu X Y, et al. Adaptive Co-attention Network for Named Entity Recognition in Tweets[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5674-5681.
[3] 吴友政, 李浩然, 姚霆, 等. 多模态信息处理前沿综述:应用、融合和预训练[J]. 中文信息学报, 2022, 36(5): 1-20.
[3] (Wu Youzheng, Li Haoran, Yao Ting, et al. A Survey of Multimodal Information Processing Frontiers: Application, Fusion and Pre-training[J]. Journal of Chinese Information Processing, 2022, 36(5): 1-20.)
[4] Yao W, Yoshinaga N. Visually-Guided Named Entity Recognition by Grounding Words with Images via Dense Retrieval[C]// Proceedings of the Association for Natural Language Processing. 2022: 1361-1365.
[5] Elliott D, Frank S, Hasler E. Multilingual Image Description with Neural Sequence Models[OL]. arXiv Preprint, arXiv:1510.04709.
[6] Antol S, Agrawal A, Lu J S, et al. VQA: Visual Question Answering[C]// Proceedings of 2015 IEEE International Conference on Computer Vision. 2015: 2425-2433.
[7] Zhu X R, Li Z X, Wang X D, et al. Multi-modal Knowledge Graph Construction and Application: A Survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(2): 715-735.
[8] Baltrušaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443.
doi: 10.1109/TPAMI.2018.2798607 pmid: 29994351
[9] Liang P P, Zadeh A, Morency L P. Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions[OL]. arXiv Preprint, arXiv:2209.03430.
[10] 何俊, 张彩庆, 李小珍, 等. 面向深度学习的多模态融合技术研究综述[J]. 计算机工程, 2020, 46(5): 1-11.
doi: 10.19678/j.issn.1000-3428.0057370
[10] (He Jun, Zhang Caiqing, Li Xiaozhen, et al. Survey of Research on Multimodal Fusion Technology for Deep Learning[J]. Computer Engineering, 2020, 46(5): 1-11.)
doi: 10.19678/j.issn.1000-3428.0057370
[11] 王惠茹, 李秀红, 李哲, 等. 多模态预训练模型综述[J]. 计算机应用, 2023, 43(4): 991-1004.
[11] (Wang Huiru, Li Xiuhong, Li Zhe, et al. Survey of Multimodal Pre-training Models[J]. Journal of Computer Applications, 2023, 43(4): 991-1004.)
[12] Lu D, Neves L, Carvalho V, et al. Visual Attention Model for Name Tagging in Multimodal Social Media[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1:Long Papers). 2018: 1990-1999.
[13] Arshad O, Gallo I, Nawaz S, et al. Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition[C]// Proceedings of 2019 International Conference on Document Analysis and Recognition. IEEE, 2019: 337-342.
[14] Yu J, Jiang J, Yang L, et al. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 3342-2252.
[15] Asgari-Chenaghlu M, Feizi-Derakhshi M R, Farzinvash L, et al. CWI: A Multimodal Deep Learning Approach for Named Entity Recognition from Social Media Using Character, Word and Image Features[J]. Neural Computing and Applications, 2022, 34(3): 1905-1922.
[16] Zhang D, Wei S Z, Li S S, et al. Multi-Modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021: 14347-14355.
[17] Zheng C M, Wu Z W, Wang T, et al. Object-Aware Multimodal Named Entity Recognition in Social Media Posts with Adversarial Learning[J]. IEEE Transactions on Multimedia, 2020, 23: 2520-2532.
[18] Sun L, Wang J Q, Su Y D, et al. RIVA: A Pre-trained Tweet Multimodal Model Based on Text-Image Relation for Multimodal NER[C]// Proceedings of the 28th International Conference on Computational Linguistics. 2020: 1852-1862.
[19] Sun L, Wang J Q, Zhang K, et al. RpBERT: A Text-Image Relation Propagation-Based BERT Model for Multimodal NER[C]// Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021: 13860-13868.
[20] Liu L P, Wang M L, Zhang M Z, et al. UAMNer: Uncertainty-Aware Multimodal Named Entity Recognition in Social Media Posts[J]. Applied Intelligence, 2022, 52(4): 4109-4125.
[21] Sui D B, Tian Z K, Chen Y B, et al. A Large-Scale Chinese Multimodal NER Dataset with Speech Clues[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 2807-2818.
[22] 范涛, 王昊, 陈玥彤. 基于深度迁移学习的地方志多模态命名实体识别研究[J]. 情报学报, 2022, 41(4): 412-423.
[22] (Fan Tao, Wang Hao, Chen Yuetong. Research on Multimodal Named Entity Recognition of Local History Based on Deep Transfer Learning[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(4): 412-423.)
[23] Xuan Z Y, Bao R, Jiang S Y. FGN: Fusion Glyph Network for Chinese Named Entity Recognition[C]// Proceedings of China Conference on Knowledge Graph and Semantic Computing. 2020: 28-40.
[24] Chen D W, Li Z X, Gu B B, et al. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge[C]// Proceedings of International Conference on Database Systems for Advanced Applications. 2021: 186-201.
[25] Wang X W, Tian J F, Gui M, et al. PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition[C]// Proceedings of International Conference on Database Systems for Advanced Applications. 2022: 297-305.
[26] Xu B, Huang S Z, Sha C F, et al. MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition[C]// Proceedings of the 15th ACM International Conference on Web Search and Data Mining. 2022: 1215-1223.
[27] Liu Y, Huang S B, Li R S, et al. USAF: Multimodal Chinese Named Entity Recognition Using Synthesized Acoustic Features[J]. Information Processing & Management, 2023, 60(3): 103290.
[28] Tian Y, Sun X, Yu H F, et al. Hierarchical Self-Adaptation Network for Multimodal Named Entity Recognition in Social Media[J]. Neurocomputing, 2021, 439: 12-21.
[29] Wang X Y, Gui M, Jiang Y, et al. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition[OL]. arXiv Preprint, arXiv:2112.06482.
[30] 李晓腾, 张盼盼, 勾智楠, 等. 基于多任务学习的多模态命名实体识别方法[J]. 计算机工程, 2023, 49(4): 114-119.
doi: 10.19678/j.issn.1000-3428.0064087
[30] (Li Xiaoteng, Zhang Panpan, Gou Zhinan, et al. Multi-Modal Named Entity Recognition Method Based on Multi-Task Learning[J]. Computer Engineering, 2023, 49(4):114-119.)
doi: 10.19678/j.issn.1000-3428.0064087
[31] Huang Y, Du C Z, Xue Z H, et al. What Makes Multi-modal Learning Better than Single[C]// Proceedings of the 35th Conference on Neural Information Processing Systems. 2021.
[32] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv:1301.3781.
[33] 李代祎, 张笑文, 严丽. 一种基于异构图网络的多模态实体识别方法[J/OL]. 小型微型计算机系统: 1-10. [2023-07-24]. https://kns.cnki.net/kcms/detail/21.1106.TP.20230711.1048.002.html.
[33] (Li Daiyi, Zhang Xiaowen, Yan Li. A Multimodal Name Entity Recognition Method Based on Heterogeneous Graph Network[J]. Journal of Chinese Computer Systems: 1-10. [2023-07-24]. http://kns.cnki.net/kcms/detail/21.1106.TP.20230711.1048.002.html.)
[34] Kattenborn T, Leitloff J, Schiefer F, et al. Review on Convolutional Neural Networks (CNN) in Vegetation Remote Sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24-49.
[35] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[36] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:810.04805.
[37] Chen S G, Aguilar G, Neves L, et al. Can Images Help Recognize Entities? A Study of The Role of Images for Multimodal NER[OL]. arXiv Preprint, arXiv:2010.12712.
[38] He K M, Zhang X Y, Ren S Q, et al. Deep Residual Learning for Image Recognition[C]// Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
[39] Zhu T G, Wang Y, Li H R, et al. Multimodal Joint Attribute Prediction and Value Extraction for E-Commerce Product[OL]. arXiv Preprint, arXiv:2009.07162.
[40] Hu X M. Multimodal Named Entity Recognition and Relation Extraction with Retrieval-Augmented Strategy[C]// Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2023: 3488.
[41] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
doi: 10.1109/TPAMI.2016.2577031 pmid: 27295650
[42] Kiela D, Bottou L. Learning Image Embeddings Using Convolutional Neural Networks for Improved Multi-Modal Semantics[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 36-45.
[43] Anderson P, He X D, Buehler C, et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.
[44] Dong L H, Xu S, Xu B. Speech-Transformer: A No-recurrence Sequence-to-Sequence Model for Speech Recognition[C]// Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. 2018: 5884-5888.
[45] Purwins H, Li B, Virtanen T, et al. Deep Learning for Audio Signal Processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2019, 13(2): 206-219.
doi: 10.1109/JSTSP.2019.2908700
[46] 胡峰松, 张璇. 基于梅尔频率倒谱系数与翻转梅尔频率倒谱系数的说话人识别方法[J]. 计算机应用, 2012, 32(9): 2542-2544.
[46] (Hu Fengsong, Zhang Xuan. Speaker Recognition Method Based on Mel Frequency Cepstrum Coefficient and Inverted Mel Frequency Cepstrum Coefficient[J]. Journal of Computer Applications, 2012, 32(9):2542-2544.)
[47] Zhang X, Yuan J L, Li L, et al. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition[C]// Proceedings of the 16th ACM International Conference on Web Search and Data Mining. 2023: 958-966.
[48] Liu P P, Li H, Ren Y M, et al. A Novel Framework for Multimodal Named Entity Recognition with Multi-level Alignments[OL]. arXiv Preprint, arXiv: 2305.08372.
[49] Khare Y, Bagal V, Mathew M, et al. MMBERT: Multimodal BERT Pretraining for Improved Medical VQA[C]// Proceedings of 2021 IEEE 18th International Symposium on Biomedical Imaging. 2021: 1033-1036.
[50] Jiang Y G, Wu Z X, Wang J, et al. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(2): 352-364.
[51] Habibian A, Mensink T, Snoek C G M. Video2vec Embeddings Recognize Events when Examples are Scarce[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(10): 2089-2103.
doi: 10.1109/TPAMI.2016.2627563 pmid: 27849523
[52] Fukui A, Park D H, Yang D, et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding[OL]. arXiv Preprint, arXiv:1606.01847.
[53] Lu J S, Yang J W, Batra D, et al. Hierarchical Question-Image Co-Attention for Visual Question Answering[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016: 289-297.
[54] Zadeh A, Liang P P, Poria S, et al. Multi-attention Recurrent Network for Human Communication Comprehension[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5642-5649.
[55] Pang L, Ngo C W. Mutlimodal Learning with Deep Boltzmann Machine for Emotion Prediction in User Generated Videos[C]// Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. 2015: 619-622.
[56] Martínez H P, Yannakakis G N. Deep Multimodal Fusion: Combining Discrete Events and Continuous Signals[C]// Proceedings of the 16th International Conference on Multimodal Interaction. 2014: 34-41.
[57] Rasiwasia N, Pereira J C, Coviello E, et al. A New Approach to Cross-Modal Multimedia Retrieval[C]// Proceedings of the 18th ACM International Conference on Multimedia. 2010: 251-260.
[58] Wang B K, Yang Y, Xu X, et al. Adversarial Cross-Modal Retrieval[C]// Proceedings of the 25th ACM International Conference on Multimedia. 2017: 154-162.
[59] Wang X W, Ye J B, Li Z X, et al. CAT-MNER: Multimodal Named Entity Recognition with Knowledge-Refined Cross-Modal Attention[C]// Proceedings of 2022 IEEE International Conference on Multimedia and Expo. 2022: 1-6.
[60] Yin Y J, Meng F D, Su J S, et al. A Novel Graph-Based Multi-modal Fusion Encoder for Neural Machine Translation[OL]. arXiv Preprint, arXiv:2007.08742.
[61] Poria S, Chaturvedi I, Cambria E, et al. Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis[C]// Proceedings of 2016 IEEE 16th International Conference on Data Mining. 2016: 439-448.
[62] Zadeh A, Zellers R, Pincus E, et al. Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[63] Zhou B H, Zhang Y, Song K H, et al. A Span-Based Multimodal Variational Autoencoder for Semi-supervised Multimodal Named Entity Recognition[C]// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022: 6293-6302.
[64] Nojavanasghari B, Gopinath D, Koushik J, et al. Deep Multimodal Fusion for Persuasiveness Prediction[C]// Proceedings of the 18th ACM International Conference on Multimodal Interaction. 2016: 284-288.
[65] Kampman O, Barezi E J, Bertero D, et al. Investigating Audio, Visual, and Text Fusion Methods for End-to-End Automatic Personality Prediction[OL]. arXiv Preprint, arXiv:1805.00705.
[66] Wu Z W, Zheng C M, Cai Y, et al. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts[C]// Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1038-1046.
[67] Zhao X Y, Tang B Z. Multimodal Named Entity Recognition via Co-attention-Based Method with Dynamic Visual Concept Expansion[C]// Proceedings of International Conference on Neural Information Processing. 2021: 476-487.
[68] Kim J H, Jun J, Zhang B T. Bilinear Attention Networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018: 1571-1581.
[69] Zhao G, Dong G T, Shi Y D, et al. Entity-Level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition[C]// Findings of the Association for Computational Linguistics:EMNLP 2022. 2022: 6345-6350.
[70] 尹奇跃, 黄岩, 张俊格, 等. 基于深度学习的跨模态检索综述[J]. 中国图象图形学报, 2021, 26(6): 1368-1388.
[70] (Yin Qiyue, Huang Yan, Zhang Junge, et al. Survey on Deep Learning Based Cross-Modal Retrieval[J]. Journal of Image and Graphics, 2021, 26(6): 1368-1388.)
[71] 李志义, 黄子风, 许晓绵. 基于表示学习的跨模态检索模型与特征抽取研究综述[J]. 情报学报, 2018, 37(4): 422-435.
[71] (Li Zhiyi, Huang Zifeng, Xu Xiaomian. A Review of the Cross-Modal Retrieval Model and Feature Extraction Based on Representation Learning[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(4): 422-435.)
[72] 唐樾, 马静. 基于增强对抗网络和多模态融合的谣言检测方法[J]. 情报科学, 2022, 40(6): 108-114.
[72] (Tang Yue, Ma Jing. A Rumor Detection Method Based on Enhance Adversarial Network and Multimodal Fusion[J]. Information Science, 2022, 40(6): 108-114.)
[73] Li L H, Yatskar M, Yin D, et al. VisualBERT: A Simple and Performant Baseline for Vision and Language[OL]. arXiv Preprint, arXiv: 1908.03557.
[74] Su W J, Zhu X Z, Cao Y, et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations[OL]. arXiv Preprint, arXiv: 1908.08530.
[75] Chen Y C, Li L J, Yu L C, et al. UNITER: UNiversal Image-TExt Representation Learning[C]// Proceedings of European Conference on Computer Vision. 2020: 104-120.
[76] Lu J S, Batra D, Parikh D, et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks[C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019: 13-23.
[77] Tan H, Bansal M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers[OL]. arXiv Preprint, arXiv:1908.07490.
[78] Radford A, Kim J W, Hallacy C, et al. Learning Transferable Visual Models from Natural Language Supervision[C]// Proceedings of the 38th International Conference on Machine Learning. 2021: 8748-8763.
[79] Li G, Duan N, Fang Y J, et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-training[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020: 11336-11344.
[80] Chen X L, Fang H, Lin T Y, et al. Microsoft COCO Captions: Data Collection and Evaluation Server[OL]. arXiv Preprint, arXiv: 1504.00325.
[81] Dou Z Y, Xu Y C, Gan Z, et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers[C]// Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 18145-18155.
[82] Yuan L, Cai Y, Wang J, et al. Joint Multimodal Entity-Relation Extraction Based on Edge-Enhanced Graph Alignment Network and Word-Pair Relation Tagging[OL]. arXiv Preprint, arXiv: 2211.15028.
[83] Wang P, Chen X H, Shang Z Y, et al. Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning[J]. IEICE Transactions on Information and Systems, 2023, 106(4): 545-555.
[1] 赵萌, 王昊, 李晓敏. 中国民歌多情感识别及情感变化规律分析研究*[J]. 数据分析与知识发现, 2023, 7(7): 111-124.
[2] 赵朝阳, 朱贵波, 王金桥. ChatGPT给语言大模型带来的启示和多模态大模型新的发展思路*[J]. 数据分析与知识发现, 2023, 7(3): 26-35.
[3] 孙巍. 一种基于复合文本描述的科学数据特征表示方法*[J]. 现代图书情报技术, 2009, 25(5): 22-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn