Timezone: »
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech -- in an entirely label-free, self-supervised manner. By pretraining with a mix of both frame-level (spatial) and video-level (temporal) objectives, our model not only learns to match images to temporally corresponding words, but also to contextualize what is happening globally over time. As a result, MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets when finetuned. It also transfers well to the world of static images, allowing models to reason about the dynamic context behind visual scenes. On Visual Commonsense Reasoning, MERLOT~answers questions correctly with 80.6\% accuracy, outperforming state-of-the-art models of similar size by over 3\%, even those that make heavy use of auxiliary supervised data (like object bounding boxes).Ablation analyses demonstrate the complementary importance of: 1) training on videos versus static images; 2) scaling the magnitude and diversity of the pretraining video corpus; and 3) using diverse objectives that encourage full-stack multimodal reasoning, from the recognition to cognition level.
Author Information
Rowan Zellers (University of Washington)
Ximing Lu (Department of Computer Science, University of Washington)
Jack Hessel (Cornell University)
Youngjae Yu (Seoul National University)
Jae Sung Park (University of Washington)
Jize Cao (Department of Computer Science, University of Washington)
Ali Farhadi (University of Washington, Allen Institute for Artificial Intelligence)
Yejin Choi (University of Washington)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Oral: MERLOT: Multimodal Neural Script Knowledge Models »
Sat. Dec 11th 12:00 -- 12:15 AM Room
More from the Same Authors
-
2021 : CommonsenseQA 2.0: Exposing the Limits of AI through Gamification »
Alon Talmor · Ori Yoran · Ronan Le Bras · Chandra Bhagavatula · Yoav Goldberg · Yejin Choi · Jonathan Berant -
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho -
2021 : Towards Grounded Natural Language Proof Generation »
Sean Welleck · Jiacheng Liu · Yejin Choi -
2021 : Robust fine-tuning of zero-shot models »
Mitchell Wortsman · Gabriel Ilharco · Jong Wook Kim · Mike Li · Hanna Hajishirzi · Ali Farhadi · Hongseok Namkoong · Ludwig Schmidt -
2022 : Information-Theoretic Evaluation of Free-Text Rationales with Conditional $\mathcal{V}$-Information »
Hanjie Chen · Faeze Brahman · Xiang Ren · Yangfeng Ji · Yejin Choi · Swabha Swayamdipta -
2023 Poster: Stable and low-precision training for large-scale vision-language models »
Mitchell Wortsman · Tim Dettmers · Luke Zettlemoyer · Ari Morcos · Ali Farhadi · Ludwig Schmidt -
2023 Poster: AdANNS: A Framework for Adaptive Semantic Search »
Aniket Rege · Aditya Kusupati · Sharan Ranjit S · Alan Fan · Qingqing Cao · Sham Kakade · Prateek Jain · Ali Farhadi -
2023 Poster: Localized Symbolic Knowledge Distillation for Visual Commonsense Models »
Jae Sung Park · Jack Hessel · Khyathi Chandu · Paul Pu Liang · Ximing Lu · Qiuyuan Huang · Peter West · Jianfeng Gao · Ali Farhadi · Yejin Choi -
2023 Poster: SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks »
Bill Yuchen Lin · Yicheng Fu · Karina Yang · Prithviraj (Raj) Ammanabrolu · Faeze Brahman · Shiyu Huang · Chandra Bhagavatula · Yejin Choi · Xiang Ren -
2023 Poster: Faith and Fate: Limits of Transformers on Compositionality »
Nouha Dziri · Ximing Lu · Melanie Sclar · Xiang (Lorraine) Li · Liwei Jiang · Bill Yuchen Lin · Sean Welleck · Peter West · Chandra Bhagavatula · Ronan Le Bras · Jena Hwang · Soumya Sanyal · Xiang Ren · Allyson Ettinger · Zaid Harchaoui · Yejin Choi -
2023 Poster: Neural Priming for Sample-Efficient Adaptation »
Matthew Wallingford · Vivek Ramanujan · Alex Fang · Aditya Kusupati · Roozbeh Mottaghi · Aniruddha Kembhavi · Ludwig Schmidt · Ali Farhadi -
2023 Poster: On the Connection between Pre-training Data Diversity and Fine-tuning Robustness »
Vivek Ramanujan · Thao Nguyen · Sewoong Oh · Ali Farhadi · Ludwig Schmidt -
2023 Poster: Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text »
Wanrong Zhu · Jack Hessel · Anas Awadalla · Samir Yitzhak Gadre · Jesse Dodge · Alex Fang · Youngjae Yu · Ludwig Schmidt · William Yang Wang · Yejin Choi -
2023 Poster: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Poster: Objaverse-XL: A Colossal Universe of 3D Objects »
Matt Deitke · Ruoshi Liu · Matthew Wallingford · Huong Ngo · Oscar Michel · Aditya Kusupati · Alan Fan · Christian Laforte · Vikram Voleti · Samir Yitzhak Gadre · Eli VanderBilt · Aniruddha Kembhavi · Carl Vondrick · Georgia Gkioxari · Kiana Ehsani · Ludwig Schmidt · Ali Farhadi -
2023 Poster: RealTime QA: What's the Answer Right Now? »
Jungo Kasai · Keisuke Sakaguchi · yoichi takahashi · Ronan Le Bras · Akari Asai · Xinyan Yu · Dragomir Radev · Noah Smith · Yejin Choi · Kentaro Inui -
2023 Oral: DataComp: In search of the next generation of multimodal datasets »
Samir Yitzhak Gadre · Gabriel Ilharco · Alex Fang · Jonathan Hayase · Georgios Smyrnis · Thao Nguyen · Ryan Marten · Mitchell Wortsman · Dhruba Ghosh · Jieyu Zhang · Eyal Orgad · Rahim Entezari · Giannis Daras · Sarah Pratt · Vivek Ramanujan · Yonatan Bitton · Kalyani Marathe · Stephen Mussmann · Richard Vencu · Mehdi Cherti · Ranjay Krishna · Pang Wei Koh · Olga Saukh · Alexander Ratner · Shuran Song · Hannaneh Hajishirzi · Ali Farhadi · Romain Beaumont · Sewoong Oh · Alex Dimakis · Jenia Jitsev · Yair Carmon · Vaishaal Shankar · Ludwig Schmidt -
2023 Workshop: AI meets Moral Philosophy and Moral Psychology: An Interdisciplinary Dialogue about Computational Ethics »
Sydney Levine · Liwei Jiang · Jared Moore · Zhijing Jin · Yejin Choi -
2022 Poster: Patching open-vocabulary models by interpolating weights »
Gabriel Ilharco · Mitchell Wortsman · Samir Yitzhak Gadre · Shuran Song · Hannaneh Hajishirzi · Simon Kornblith · Ali Farhadi · Ludwig Schmidt -
2022 Poster: COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics »
Lianhui Qin · Sean Welleck · Daniel Khashabi · Yejin Choi -
2022 Poster: QUARK: Controllable Text Generation with Reinforced Unlearning »
Ximing Lu · Sean Welleck · Jack Hessel · Liwei Jiang · Lianhui Qin · Peter West · Prithviraj Ammanabrolu · Yejin Choi -
2022 Poster: NaturalProver: Grounded Mathematical Proof Generation with Language Models »
Sean Welleck · Jiacheng Liu · Ximing Lu · Hannaneh Hajishirzi · Yejin Choi -
2022 Poster: Matryoshka Representation Learning »
Aditya Kusupati · Gantavya Bhatt · Aniket Rege · Matthew Wallingford · Aditya Sinha · Vivek Ramanujan · William Howard-Snyder · Kaifeng Chen · Sham Kakade · Prateek Jain · Ali Farhadi -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Battling with Larger Models through Grounding and Searching »
Yejin Choi -
2021 : NaturalProofs: Mathematical Theorem Proving in Natural Language »
Sean Welleck · Jiacheng Liu · Ronan Le Bras · Hanna Hajishirzi · Yejin Choi · Kyunghyun Cho -
2021 Poster: Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals »
Lang Liu · Krishna Pillutla · Sean Welleck · Sewoong Oh · Yejin Choi · Zaid Harchaoui -
2021 Poster: LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes »
Aditya Kusupati · Matthew Wallingford · Vivek Ramanujan · Raghav Somani · Jae Sung Park · Krishna Pillutla · Prateek Jain · Sham Kakade · Ali Farhadi -
2021 Poster: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui -
2021 : CommonsenseQA 2.0: Exposing the Limits of AI through Gamification »
Alon Talmor · Ori Yoran · Ronan Le Bras · Chandra Bhagavatula · Yoav Goldberg · Yejin Choi · Jonathan Berant -
2021 Oral: MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers »
Krishna Pillutla · Swabha Swayamdipta · Rowan Zellers · John Thickstun · Sean Welleck · Yejin Choi · Zaid Harchaoui -
2020 : Panel Discussion & Closing »
Yejin Choi · Alexei Efros · Chelsea Finn · Kristen Grauman · Quoc V Le · Yann LeCun · Ruslan Salakhutdinov · Eric Xing -
2020 : QA: Yejin Choi »
Yejin Choi -
2020 : Invited Talk: Yejin Choi »
Yejin Choi -
2020 : Adversarial, Socially Aware, and Commonsensical Data »
Yejin Choi -
2019 : Invited Talk (Yejin Choi) »
Yejin Choi -
2019 : Yejin Choi »
Yejin Choi -
2019 Poster: Defending Against Neural Fake News »
Rowan Zellers · Ari Holtzman · Hannah Rashkin · Yonatan Bisk · Ali Farhadi · Franziska Roesner · Yejin Choi -
2019 Poster: Discovering Neural Wirings »
Mitchell Wortsman · Ali Farhadi · Mohammad Rastegari