Timezone: »

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
Hangbo Bao · Wenhui Wang · Li Dong · Qiang Liu · Owais Khan Mohammed · Kriti Aggarwal · Subhojit Som · Songhao Piao · Furu Wei

Thu Dec 01 09:00 AM -- 11:00 AM (PST) @ Hall J #635

We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network. Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer. Because of the modeling flexibility of Multiway Transformer, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks, or used as a dual encoder for efficient image-text retrieval. Moreover, we propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs. Experimental results show that VLMo achieves state-of-the-art results on various vision-language tasks, including VQA, NLVR2 and image-text retrieval.

Author Information

Hangbo Bao (Harbin Institute of Technology)
Wenhui Wang (Microsoft Research)
Li Dong (Microsoft Research)
Qiang Liu (, Chinese Academy of Sciences)
Owais Khan Mohammed (Indian Institute of Technology, Bombay)
Kriti Aggarwal (Microsoft)
Subhojit Som (Microsoft)
Songhao Piao (harbin institue of technology)
Furu Wei (Microsoft Research Asia)

More from the Same Authors