Timezone: »
We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the ``free'' adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.
Author Information
Zhe Gan (Microsoft)
Yen-Chun Chen (Microsoft)
Linjie Li (Microsoft)
Chen Zhu (University of Maryland, College Park)
Yu Cheng (Microsoft)
Jingjing Liu (Microsoft)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: Large-Scale Adversarial Training for Vision-and-Language Representation Learning »
Wed. Dec 9th 05:00 -- 07:00 AM Room Poster Session 2 #676
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2021 : Diurnal or Nocturnal? Federated Learning from Periodically Shifting Distributions »
Chen Zhu · Zheng Xu · Mingqing Chen · Jakub Konečný · Andrew S Hard · Tom Goldstein -
2022 : Interactive Industrial Panel »
Jiahao Sun · Ahmed Ibrahim · Marjan Ghazvininejad · Yu Cheng · Boxing Chen · Mohammad Norouzi · Rahul Gupta -
2022 Poster: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone »
Zi-Yi Dou · Aishwarya Kamath · Zhe Gan · Pengchuan Zhang · Jianfeng Wang · Linjie Li · Zicheng Liu · Ce Liu · Yann LeCun · Nanyun Peng · Jianfeng Gao · Lijuan Wang -
2022 Poster: Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors »
Ravid Shwartz-Ziv · Micah Goldblum · Hossein Souri · Sanyam Kapoor · Chen Zhu · Yann LeCun · Andrew Wilson -
2022 Poster: M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design »
hanxue liang · Zhiwen Fan · Rishov Sarkar · Ziyu Jiang · Tianlong Chen · Kai Zou · Yu Cheng · Cong Hao · Zhangyang Wang -
2022 Poster: GLIPv2: Unifying Localization and Vision-Language Understanding »
Haotian Zhang · Pengchuan Zhang · Xiaowei Hu · Yen-Chun Chen · Liunian Li · Xiyang Dai · Lijuan Wang · Lu Yuan · Jenq-Neng Hwang · Jianfeng Gao -
2021 Poster: VQ-GNN: A Universal Framework to Scale up Graph Neural Networks using Vector Quantization »
Mucong Ding · Kezhi Kong · Jingling Li · Chen Zhu · John Dickerson · Furong Huang · Tom Goldstein -
2021 Poster: GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training »
Chen Zhu · Renkun Ni · Zheng Xu · Kezhi Kong · W. Ronny Huang · Tom Goldstein -
2021 Poster: Long-Short Transformer: Efficient Transformers for Language and Vision »
Chen Zhu · Wei Ping · Chaowei Xiao · Mohammad Shoeybi · Tom Goldstein · Anima Anandkumar · Bryan Catanzaro -
2020 : The Intrinsic Dimension of Images and Its Impact on Learning »
Chen Zhu · Micah Goldblum · Ahmed Abdelkader · Tom Goldstein · Phillip Pope -
2019 : Break / Poster Session 1 »
Antonia Marcu · Yao-Yuan Yang · Pascale Gourdeau · Chen Zhu · Thodoris Lykouris · Jianfeng Chi · Mark Kozdoba · Arjun Nitin Bhagoji · Xiaoxia Wu · Jay Nandy · Michael T Smith · Bingyang Wen · Yuege Xie · Konstantinos Pitas · Suprosanna Shit · Maksym Andriushchenko · Dingli Yu · Gaël Letarte · Misha Khodak · Hussein Mozannar · Chara Podimata · James Foulds · Yizhen Wang · Huishuai Zhang · Ondrej Kuzelka · Alexander Levine · Nan Lu · Zakaria Mhammedi · Paul Viallard · Diana Cai · Lovedeep Gondara · James Lucas · Yasaman Mahdaviyeh · Aristide Baratin · Rishi Bommasani · Alessandro Barp · Andrew Ilyas · Kaiwen Wu · Jens Behrmann · Omar Rivasplata · Amir Nazemi · Aditi Raghunathan · Will Stephenson · Sahil Singla · Akhil Gupta · YooJung Choi · Yannic Kilcher · Clare Lyle · Edoardo Manino · Andrew Bennett · Zhi Xu · Niladri Chatterji · Emre Barut · Flavien Prost · Rodrigo Toro Icarte · Arno Blaas · Chulhee Yun · Sahin Lale · YiDing Jiang · Tharun Kumar Reddy Medini · Ashkan Rezaei · Alexander Meinke · Stephen Mell · Gary Kazantsev · Shivam Garg · Aradhana Sinha · Vishnu Lokhande · Geovani Rizk · Han Zhao · Aditya Kumar Akash · Jikai Hou · Ali Ghodsi · Matthias Hein · Tyler Sypherd · Yichen Yang · Anastasia Pentina · Pierre Gillot · Antoine Ledent · Guy Gur-Ari · Noah MacAulay · Tianzong Zhang -
2019 Poster: Improving Textual Network Learning with Variational Homophilic Embeddings »
Wenlin Wang · Chenyang Tao · Zhe Gan · Guoyin Wang · Liqun Chen · Xinyuan Zhang · Ruiyi Zhang · Qian Yang · Ricardo Henao · Lawrence Carin -
2018 Poster: Adversarial Text Generation via Feature-Mover's Distance »
Liqun Chen · Shuyang Dai · Chenyang Tao · Haichao Zhang · Zhe Gan · Dinghan Shen · Yizhe Zhang · Guoyin Wang · Dinghan Shen · Lawrence Carin -
2018 Poster: Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization »
Yizhe Zhang · Michel Galley · Jianfeng Gao · Zhe Gan · Xiujun Li · Chris Brockett · Bill Dolan