Timezone: »

Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Zhe Gan · Yen-Chun Chen · Linjie Li · Chen Zhu · Yu Cheng · Jingjing Liu

Tue Dec 08 07:30 PM -- 07:40 PM (PST) @ Orals & Spotlights: Vision Applications

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the ``free'' adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

Author Information

Zhe Gan (Microsoft)
Yen-Chun Chen (Microsoft)
Linjie Li (Microsoft)
Chen Zhu (University of Maryland, College Park)
Yu Cheng (Microsoft)
Jingjing Liu (Microsoft)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors