Timezone: »
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.
Author Information
Jiasen Lu (Georgia Tech)
Dhruv Batra (Georgia Tech / Facebook AI Research (FAIR))
Devi Parikh (Georgia Tech / Facebook AI Research (FAIR))
Stefan Lee (Oregon State University)
More from the Same Authors
-
2021 Spotlight: Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · John Turner · Noah Maestre · Mustafa Mukadam · Devendra Singh Chaplot · Oleksandr Maksymets · Aaron Gokaslan · Vladimír Vondruš · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI »
Santhosh Kumar Ramakrishnan · Aaron Gokaslan · Erik Wijmans · Oleksandr Maksymets · Alexander Clegg · John Turner · Eric Undersander · Wojciech Galuba · Andrew Westbury · Angel Chang · Manolis Savva · Yili Zhao · Dhruv Batra -
2022 : Fifteen-minute Competition Overview Video »
Dhruv Batra · Manolis Savva · Zsolt Kira · Vincent-Pierre Berges · Karmesh Yadav · Angel Chang · Andrew Szot · Alexander Clegg · Aaron Gokaslan -
2023 Competition: The HomeRobot Open Vocabulary Mobile Manipulation Challenge »
Sriram Yenamandra · Arun Ramachandran · Mukul Khanna · Karmesh Yadav · Devendra Singh Chaplot · Gunjan Chhablani · Alexander Clegg · Theophile Gervet · Vidhi Jain · Ruslan Partsey · Ram Ramrakhya · Andrew Szot · Austin Wang · Tsung-Yen Yang · Aaron Edsinger · Charles Kemp · Binit Shah · Zsolt Kira · Dhruv Batra · Roozbeh Mottaghi · Yonatan Bisk · Chris Paxton -
2022 Competition: Habitat Rearrangement Challenge »
Andrew Szot · Karmesh Yadav · Alexander Clegg · Vincent-Pierre Berges · Aaron Gokaslan · Angel Chang · Manolis Savva · Zsolt Kira · Dhruv Batra -
2022 Poster: VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement »
Erik Wijmans · Irfan Essa · Dhruv Batra -
2022 Poster: SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning »
Changan Chen · Carl Schissler · Sanchit Garg · Philip Kobernik · Alexander Clegg · Paul Calamia · Dhruv Batra · Philip Robinson · Kristen Grauman -
2022 Poster: ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings »
Arjun Majumdar · Gunjan Aggarwal · Bhavika Devnani · Judy Hoffman · Dhruv Batra -
2021 : Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · Noah Maestre · Mustafa Mukadam · Oleksandr Maksymets · Aaron Gokaslan · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · Noah Maestre · Mustafa Mukadam · Oleksandr Maksymets · Aaron Gokaslan · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : AI for Augmenting Human Creativity »
Devi Parikh -
2021 : Career and Life: Panel Discussion - Bo Li, Adriana Romero-Soriano, Devi Parikh, and Emily Denton »
Remi Denton · Devi Parikh · Bo Li · Adriana Romero -
2021 Poster: SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation »
Abhinav Moudgil · Arjun Majumdar · Harsh Agrawal · Stefan Lee · Dhruv Batra -
2021 Poster: Habitat 2.0: Training Home Assistants to Rearrange their Habitat »
Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · John Turner · Noah Maestre · Mustafa Mukadam · Devendra Singh Chaplot · Oleksandr Maksymets · Aaron Gokaslan · Vladimír Vondruš · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra -
2021 : Open Catalyst Challenge + Q&A »
Abhishek Das · Muhammed Shuaibi · Siddharth Goyal · Adeesh Kolluru · Janice Lan · Aini Palizhati · Anuroop Sriram · Brandon Wood · Aditya Grover · Devi Parikh · Zachary Ulissi · Larry Zitnick -
2021 Poster: Human-Adversarial Visual Question Answering »
Sasha Sheng · Amanpreet Singh · Vedanuj Goswami · Jose Magana · Tristan Thrush · Wojciech Galuba · Devi Parikh · Douwe Kiela -
2020 Poster: Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data »
Michael Cogswell · Jiasen Lu · Rishabh Jain · Stefan Lee · Devi Parikh · Dhruv Batra -
2020 : Discussion Panel: Hugo Larochelle, Finale Doshi-Velez, Devi Parikh, Marc Deisenroth, Julien Mairal, Katja Hofmann, Phillip Isola, and Michael Bowling »
Hugo Larochelle · Finale Doshi-Velez · Marc Deisenroth · Devi Parikh · Julien Mairal · Katja Hofmann · Phillip Isola · Michael Bowling -
2019 : Panel Discussion »
Jacob Andreas · Edward Gibson · Stefan Lee · Noga Zaslavsky · Jason Eisner · Jürgen Schmidhuber -
2019 : Invited Talk - 5 »
Stefan Lee -
2019 : Opening Remarks »
Florian Strub · Harm de Vries · Abhishek Das · Stefan Lee · Erik Wijmans · Dor Arad Hudson · Alane Suhr -
2019 Workshop: Visually Grounded Interaction and Language »
Florian Strub · Abhishek Das · Erik Wijmans · Harm de Vries · Stefan Lee · Alane Suhr · Dor Arad Hudson -
2019 Poster: Cross-channel Communication Networks »
Jianwei Yang · Zhile Ren · Chuang Gan · Hongyuan Zhu · Devi Parikh -
2019 Poster: RUBi: Reducing Unimodal Biases for Visual Question Answering »
Remi Cadene · Corentin Dancette · Hedi Ben younes · Matthieu Cord · Devi Parikh -
2019 Poster: Chasing Ghosts: Instruction Following as Bayesian State Tracking »
Peter Anderson · Ayush Shrivastava · Devi Parikh · Dhruv Batra · Stefan Lee -
2018 Workshop: Visually grounded interaction and language »
Florian Strub · Harm de Vries · Erik Wijmans · Samyak Datta · Ethan Perez · Mateusz Malinowski · Stefan Lee · Peter Anderson · Aaron Courville · Jeremie MARY · Dhruv Batra · Devi Parikh · Olivier Pietquin · Chiori HORI · Tim Marks · Anoop Cherian -
2018 Poster: Overcoming Language Priors in Visual Question Answering with Adversarial Regularization »
Sainandan Ramakrishnan · Aishwarya Agrawal · Stefan Lee -
2017 : Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Mode »
Jiasen Lu -
2017 : Morning panel discussion »
Jürgen Schmidhuber · Noah Goodman · Anca Dragan · Pushmeet Kohli · Dhruv Batra -
2017 : Invited Talk 2 »
Dhruv Batra -
2017 : Panel Discussion »
Felix Hill · Olivier Pietquin · Jack Gallant · Raymond Mooney · Sanja Fidler · Chen Yu · Devi Parikh -
2017 : Towards Embodied Question Answering »
Devi Parikh -
2017 Workshop: Visually grounded interaction and language »
Florian Strub · Harm de Vries · Abhishek Das · Satwik Kottur · Stefan Lee · Mateusz Malinowski · Olivier Pietquin · Devi Parikh · Dhruv Batra · Aaron Courville · Jeremie Mary -
2017 Poster: Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model »
Jiasen Lu · Anitha Kannan · Jianwei Yang · Devi Parikh · Dhruv Batra -
2016 Poster: Hierarchical Question-Image Co-Attention for Visual Question Answering »
Jiasen Lu · Jianwei Yang · Dhruv Batra · Devi Parikh -
2016 Poster: Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles »
Stefan Lee · Senthil Purushwalkam · Michael Cogswell · Viresh Ranjan · David Crandall · Dhruv Batra -
2015 : Visual Question Answering »
Dhruv Batra -
2015 Poster: SubmodBoxes: Near-Optimal Search for a Set of Diverse Object Proposals »
Qing Sun · Dhruv Batra -
2014 Workshop: Discrete Optimization in Machine Learning »
Jeffrey A Bilmes · Andreas Krause · Stefanie Jegelka · S Thomas McCormick · Sebastian Nowozin · Yaron Singer · Dhruv Batra · Volkan Cevher -
2014 Poster: Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets »
Adarsh Prasad · Stefanie Jegelka · Dhruv Batra -
2014 Spotlight: Submodular meets Structured: Finding Diverse Subsets in Exponentially-Large Structured Item Sets »
Adarsh Prasad · Stefanie Jegelka · Dhruv Batra -
2012 Poster: Multiple Choice Learning: Learning to Produce Multiple Structured Outputs »
Abner Guzmán-Rivera · Dhruv Batra · Pushmeet Kohli -
2011 Workshop: Beyond Mahalanobis: Supervised Large-Scale Learning of Similarity »
Greg Shakhnarovich · Dhruv Batra · Brian Kulis · Kilian Q Weinberger -
2011 Poster: Understanding the Intrinsic Memorability of Images »
Phillip Isola · Devi Parikh · Antonio Torralba · Aude Oliva