Timezone: »
Visual and linguistic pre-training aims to learn vision and language representations together, which can be transferred to visual-linguistic downstream tasks. However, there exists semantic confusion between language and vision during the pre-training stage. Moreover, current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. In this work, we present a simple but effective approach for Adaptive Fine-tuning of Vision and Language pre-trained models, namely AFVL. Specifically, we introduce a pair-wise contrastive loss to learn alignments between the whole sentence and each image in the same batch during the pre-training process. At the fine-tuning stage, we introduce two lightweight adaptation networks to reduce model parameters and increase training speed for saving computation resources. We evaluate our CAVL on four main downstream tasks, including Visual Question Answering (VQA), Visual Commonsense Reasoning (VCR), Natural Language for Visual Reasoning (NLVR), and Region-to-Phrase Grounding (RPG). Compared to previous methods, our AFVL achieves comparable or better results while saving training time and GPU memory by a large margin for fine-tuning. Extensive experiments and ablation studies demonstrate the efficiency of contrastive pre-training and adaptive fine-tuning proposed in our AFVL.
Author Information
Shentong Mo (CMU)
Jingfei Xia (Carnegie Mellon University)
Ihor Markevych (Carnegie Mellon University)
More from the Same Authors
-
2021 : Simulated Annealing for Neural Architecture Search »
Shentong Mo · Jingfei Xia · Pinxu Ren -
2021 : Multi-modal Self-supervised Pre-training for Large-scale Genome Data »
Shentong Mo · Xi Fu · Chenyang Hong · Yizhen Chen · Yuxuan Zheng · Xiangru Tang · Yanyan Lan · Zhiqiang Shen · Eric Xing -
2022 Poster: Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing »
Shentong Mo · Yapeng Tian -
2022 Poster: A Closer Look at Weakly-Supervised Audio-Visual Source Localization »
Shentong Mo · Pedro Morgado -
2021 Workshop: 2nd Workshop on Self-Supervised Learning: Theory and Practice »
Pengtao Xie · Ishan Misra · Pulkit Agrawal · Abdelrahman Mohamed · Shentong Mo · Youwei Liang · Jeannette Bohg · Kristina N Toutanova -
2021 : Poster Session 2 (gather.town) »
Wenjie Li · Akhilesh Soni · Jinwuk Seok · Jianhao Ma · Jeffery Kline · Mathieu Tuli · Miaolan Xie · Robert Gower · Quanqi Hu · Matteo Cacciola · Yuanlu Bai · Boyue Li · Wenhao Zhan · Shentong Mo · Junhyung Lyle Kim · Sajad Fathi Hafshejani · Chris Junchi Li · Zhishuai Guo · Harshvardhan Harshvardhan · Neha Wadia · Tatjana Chavdarova · Difan Zou · Zixiang Chen · Aman Gupta · Jacques Chen · Betty Shea · Benoit Dherin · Aleksandr Beznosikov