Timezone: »
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of Multiway Transformer, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval.
Author Information
Hangbo Bao (Harbin Institute of Technology)
Wenhui Wang (Microsoft Research)
Li Dong (Microsoft Research)
Qiang Liu (, Chinese Academy of Sciences)
Owais Khan Mohammed (Indian Institute of Technology, Bombay)
Kriti Aggarwal (Microsoft)
Subhojit Som (Microsoft)
Songhao Piao (harbin institue of technology)
Furu Wei (Microsoft Research Asia)
More from the Same Authors
-
2022 Poster: On the Representation Collapse of Sparse Mixture of Experts »
Zewen Chi · Li Dong · Shaohan Huang · Damai Dai · Shuming Ma · Barun Patra · Saksham Singhal · Payal Bajaj · XIA SONG · Xian-Ling Mao · Heyan Huang · Furu Wei -
2022 Poster: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models »
Dongkuan (DK) Xu · Subhabrata Mukherjee · Xiaodong Liu · Debadeepta Dey · Wenhui Wang · Xiang Zhang · Ahmed Awadallah · Jianfeng Gao -
2020 Poster: BERT Loses Patience: Fast and Robust Inference with Early Exit »
Wangchunshu Zhou · Canwen Xu · Tao Ge · Julian McAuley · Ke Xu · Furu Wei -
2020 Poster: Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point »
Bita Darvish Rouhani · Daniel Lo · Ritchie Zhao · Ming Liu · Jeremy Fowers · Kalin Ovtcharov · Anna Vinogradsky · Sarah Massengill · Lita Yang · Ray Bittner · Alessandro Forin · Haishan Zhu · Taesik Na · Prerak Patel · Shuai Che · Lok Chand Koppaka · XIA SONG · Subhojit Som · Kaustav Das · Saurabh K T · Steve Reinhardt · Sitaram Lanka · Eric Chung · Doug Burger -
2020 Poster: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers »
Wenhui Wang · Furu Wei · Li Dong · Hangbo Bao · Nan Yang · Ming Zhou -
2019 Poster: Unified Language Model Pre-training for Natural Language Understanding and Generation »
Li Dong · Nan Yang · Wenhui Wang · Furu Wei · Xiaodong Liu · Yu Wang · Jianfeng Gao · Ming Zhou · Hsiao-Wuen Hon