Timezone: »
While vision-and-language models perform well on tasks such as visual question answering, they struggle when it comes to basic human commonsense reasoning skills. In this work, we introduce WinoGAViL: an online game of vision-and-language associations (e.g., between werewolves and a full moon), used as a dynamic evaluation benchmark. Inspired by the popular card game Codenames, a spymaster gives a textual cue related to several visual candidates, and another player tries to identify them. Human players are rewarded for creating associations that are challenging for a rival AI model but still solvable by other human players. We use the game to collect 3.5K instances, finding that they are intuitive for humans (>90% Jaccard index) but challenging for state-of-the-art AI models, where the best model (ViLT) achieves a score of 52%, succeeding mostly where the cue is visually salient. Our analysis as well as the feedback we collect from players indicate that the collected associations require diverse reasoning skills, including general knowledge, common sense, abstraction, and more. We release the dataset, the code and the interactive game, allowing future data collection that can be used to develop models with better association abilities.
Author Information
Yonatan Bitton (The Hebrew University of Jerusalem)
I am a PhD candidate in The Hebrew University of Jerusalem, Israel, under the supervision of Dr. Roy Shwartz and Dr. Gabriel Stanovsky. The goal of my research is to improve vision and language generalization. Specifically, I aim to develop models with better compositionality abilities, less biased and better perform on real-world examples. See my publications for more details.
Nitzan Bitton Guetta (Ben-Gurion University)
Ron Yosef (The Hebrew University of Jerusalem)
Yuval Elovici (Ben Gurion University of the Negev, Technion)
Mohit Bansal (UNC Chapel Hill)
Gabriel Stanovsky (Hebrew University of Jerusalem)
Roy Schwartz (The Hebrew University of Jerusalem)
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2022 : LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Panel: Panel 4C-4: WinoGAViL: Gamified Association… & Communicating Natural Programs… »
Sam Acquaviva · Yonatan Bitton -
2022 Poster: TVLT: Textless Vision-Language Transformer »
Zineng Tang · Jaemin Cho · Yixin Nie · Mohit Bansal -
2022 Poster: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners »
Zhenhailong Wang · Manling Li · Ruochen Xu · Luowei Zhou · Jie Lei · Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji -
2022 Poster: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2021 Poster: The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations »
Peter Hase · Harry Xie · Mohit Bansal -
2021 Poster: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer »
Zineng Tang · Jaemin Cho · Hao Tan · Mohit Bansal -
2021 Poster: Detecting Moments and Highlights in Videos via Natural Language Queries »
Jie Lei · Tamara L Berg · Mohit Bansal -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2017 Demonstration: Interactive-Length Multi-Task Video Captioning with Cooperative Feedback »
Han Guo · Ramakanth Pasunuru · Mohit Bansal