Timezone: »
Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility). Our best performing method, Visual Feature Importance Supervision (VISFIS), outperforms strong baselines on benchmark VQA datasets in terms of both in-distribution and out-of-distribution accuracy. While past work suggests that the mechanism for improved accuracy is through improved explanation plausibility, we show that this relationship depends crucially on explanation faithfulness (whether explanations truly represent the model’s internal reasoning). Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful. Lastly, we show that, surprisingly, RRR metrics are not predictive of out-of-distribution model accuracy when controlling for a model’s in-distribution accuracy, which calls into question the value of these metrics for evaluating model reasoning.
Author Information
Zhuofan Ying (University of North Carolina at Chapel Hill)
Peter Hase (University of North Carolina, Chapel Hill)
Mohit Bansal (UNC Chapel Hill)
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2022 : LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2023 Poster: Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation »
Jaemin Cho · Abhay Zala · Mohit Bansal -
2023 Poster: Resolving Interference When Merging Models »
Prateek Yadav · Derek Tam · Leshem Choshen · Colin Raffel · Mohit Bansal -
2023 Poster: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation »
Jialu Li · Mohit Bansal -
2023 Poster: Self-Chained Image-Language Model for Video Localization and Question Answering »
Shoubin Yu · Jaemin Cho · Prateek Yadav · Mohit Bansal -
2023 Poster: Paxion: Patching Action Knowledge in Video-Language Foundation Models »
Zhenhailong Wang · Ansel Blume · Sha Li · Genglin Liu · Jaemin Cho · Zineng Tang · Mohit Bansal · Heng Ji -
2023 Poster: Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind »
Swarnadeep Saha · Peter Hase · Mohit Bansal -
2023 Poster: Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models »
Peter Hase · Mohit Bansal · Been Kim · Asma Ghandeharioun -
2023 Poster: Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2023 Poster: Any-to-Any Generation via Composable Diffusion »
Zineng Tang · Ziyi Yang · Chenguang Zhu · Michael Zeng · Mohit Bansal -
2022 Poster: TVLT: Textless Vision-Language Transformer »
Zineng Tang · Jaemin Cho · Yixin Nie · Mohit Bansal -
2022 Poster: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners »
Zhenhailong Wang · Manling Li · Ruochen Xu · Luowei Zhou · Jie Lei · Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji -
2022 Poster: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models »
Yonatan Bitton · Nitzan Bitton Guetta · Ron Yosef · Yuval Elovici · Mohit Bansal · Gabriel Stanovsky · Roy Schwartz -
2021 Poster: The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations »
Peter Hase · Harry Xie · Mohit Bansal -
2021 Poster: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer »
Zineng Tang · Jaemin Cho · Hao Tan · Mohit Bansal -
2021 Poster: Detecting Moments and Highlights in Videos via Natural Language Queries »
Jie Lei · Tamara L Berg · Mohit Bansal -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2017 Demonstration: Interactive-Length Multi-Task Video Captioning with Cooperative Feedback »
Han Guo · Ramakanth Pasunuru · Mohit Bansal