Timezone: »
Spotlight
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Junnan Li · Ramprasaath Selvaraju · Akhilesh Gotmare · Shafiq Joty · Caiming Xiong · Steven Chu Hong Hoi
@
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. In order to improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR$^2$, ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-of-the-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
Author Information
Junnan Li (National University of Singapore)
Ramprasaath Selvaraju (Virginia Tech)
Akhilesh Gotmare (Salesforce Research)
I am a Machine Learning Researcher with Salesforce Research Asia in Singapore. I finished my MSc at the Department of Computer Science at EPFL, Switzerland, where I was working with Prof. Martin Jaggi's Machine Learning and Optimization laboratory for my thesis project. During my Master's, I was an intern with Salesforce Research in Palo Alto (Apr - Sept 2018).
Shafiq Joty (Nanyang Technological University)
Caiming Xiong (State Univerisity of New York at Buffalo)
Steven Chu Hong Hoi (Salesforce)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation »
Wed. Dec 8th 08:30 -- 10:00 AM Room
More from the Same Authors
-
2021 Spotlight: Understanding the Under-Coverage Bias in Uncertainty Estimation »
Yu Bai · Song Mei · Huan Wang · Caiming Xiong -
2021 Spotlight: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2022 Poster: CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning »
Hung Le · Yue Wang · Akhilesh Deepak Gotmare · Silvio Savarese · Steven Chu Hong Hoi -
2021 Poster: Sample-Efficient Learning of Stackelberg Equilibria in General-Sum Games »
Yu Bai · Chi Jin · Huan Wang · Caiming Xiong -
2021 Poster: Evaluating State-of-the-Art Classification Models Against Bayes Optimality »
Ryan Theisen · Huan Wang · Lav Varshney · Caiming Xiong · Richard Socher -
2021 Poster: Understanding the Under-Coverage Bias in Uncertainty Estimation »
Yu Bai · Song Mei · Huan Wang · Caiming Xiong -
2021 Poster: A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning »
Pan Zhou · Caiming Xiong · Xiaotong Yuan · Steven Chu Hong Hoi -
2021 Poster: Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning »
Tengyang Xie · Nan Jiang · Huan Wang · Caiming Xiong · Yu Bai -
2020 Poster: Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning »
Pan Zhou · Jiashi Feng · Chao Ma · Caiming Xiong · Steven Chu Hong Hoi · Weinan E -
2020 Poster: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Oral: Theory-Inspired Path-Regularized Differential Network Architecture Search »
Pan Zhou · Caiming Xiong · Richard Socher · Steven Chu Hong Hoi -
2020 Poster: Data Diversification: A Simple Strategy For Neural Machine Translation »
Xuan-Phi Nguyen · Shafiq Joty · Kui Wu · Ai Ti Aw -
2020 Poster: Self-Supervised Relationship Probing »
Jiuxiang Gu · Jason Kuen · Shafiq Joty · Jianfei Cai · Vlad I. Morariu · Handong Zhao · Tong Sun -
2018 : Poster Session 1 (note there are numerous missing names here, all papers appear in all poster sessions) »
Akhilesh Gotmare · Kenneth Holstein · Jan Brabec · Michal Uricar · Kaleigh Clary · Cynthia Rudin · Sam Witty · Andrew Ross · Shayne O'Brien · Babak Esmaeili · Jessica Forde · Massimo Caccia · Ali Emami · Scott Jordan · Bronwyn Woods · D. Sculley · Rebekah Overdorf · Nicolas Le Roux · Peter Henderson · Brandon Yang · Tzu-Yu Liu · David Jensen · Niccolo Dalmasso · Weitang Liu · Paul Marc TRICHELAIR · Jun Ki Lee · Akanksha Atrey · Matt Groh · Yotam Hechtlinger · Emma Tosch -
2018 Poster: Unsupervised Learning of View-invariant Action Representations »
Junnan Li · Yongkang Wong · Qi Zhao · Mohan Kankanhalli