Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings.
In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding.Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.
Sainandan Ramakrishnan (Georgia Institute of Technology)
Graduate Student Researcher at GATech in the college of computing. Worked with Devi Parikh and Stefan Lee's group, now a TA for Computer Vision by James Hays.
Aishwarya Agrawal (Georgia Institute of Technology)
Stefan Lee (Georgia Institute of Technology)
More from the Same Authors
2019 : Panel Discussion »
Jacob Andreas · Edward Gibson · Stefan Lee · Noga Zaslavsky · Jason Eisner · Jürgen Schmidhuber
2019 : Invited Talk - 5 »
2019 : Opening Remarks »
Florian Strub · Harm de Vries · Abhishek Das · Stefan Lee · Erik Wijmans · Dor Arad Hudson · Alane Suhr
2019 Workshop: Visually Grounded Interaction and Language »
Florian Strub · Abhishek Das · Erik Wijmans · Harm de Vries · Stefan Lee · Alane Suhr · Dor Arad Hudson
2019 Poster: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks »
Jiasen Lu · Dhruv Batra · Devi Parikh · Stefan Lee
2019 Poster: Chasing Ghosts: Instruction Following as Bayesian State Tracking »
Peter Anderson · Ayush Shrivastava · Devi Parikh · Dhruv Batra · Stefan Lee
2018 : Poster Sessions and Lunch (Provided) »
Akira Utsumi · Alane Suhr · Ji Zhang · Ramon Sanabria · Kushal Kafle · Nicholas Chen · Seung Wook Kim · Aishwarya Agrawal · SRI HARSHA DUMPALA · Shikhar Murty · Pablo Azagra · Jean ROUAT · Alaaeldin Ali · · SUBBAREDDY OOTA · Angela Lin · Shruti Palaskar · Farley Lai · Amir Aly · Tingke Shen · Dianqi Li · Jianguo Zhang · Rita Kuznetsova · Jinwon An · Jean-Benoit Delbrouck · Tomasz Kornuta · Syed Ashar Javed · Christopher Davis · John Co-Reyes · Vasu Sharma · Sungwon Lyu · Ning Xie · Ankita Kalra · Huan Ling · Oleksandr Maksymets · Bhavana Mahendra Jain · Shun Po Chuang · Sanyam Agarwal · Jerome Abdelnour · Yufei Feng · vincent albouy · Siddharth Karamcheti · Derek Doran · Roberta Raileanu · Jonathan Heek