Timezone: »
Compositional generalization is a key challenge in grounding natural language to visual perception. While deep learning models have achieved great success in multimodal tasks like visual question answering, recent studies have shown that they fail to generalize to new inputs that are simply an unseen combination of those seen in the training distribution. In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g. images and text). Graph representations are inherently compositional in nature and allow us to capture entities, attributes and relations in a scalable manner. Our model first creates a multimodal graph, processes it with a graph neural network to induce a factor correspondence matrix, and then outputs a symbolic program to predict answers to questions. Empirically, our model achieves close to perfect scores on a caption truth prediction problem and state-of-the-art results on the recently introduced CLOSURE dataset, improving on the mean overall accuracy across seven compositional templates by 4.77\% over previous approaches.
Author Information
Raeid Saqur (University of Toronto, Vector Institute)
Raeid Saqur is a dual program graduate student with Department of Computer Science (DCS), and Electrical and Computer Engineering (ECE) at the University of Toronto. Concentration: Machine Learning in NLP and Vision.
Karthik Narasimhan (Princeton University)
More from the Same Authors
-
2021 Spotlight: Safe Reinforcement Learning with Natural Language Constraints »
Tsung-Yen Yang · Michael Y Hu · Yinlam Chow · Peter J. Ramadge · Karthik Narasimhan -
2021 Poster: Safe Reinforcement Learning with Natural Language Constraints »
Tsung-Yen Yang · Michael Y Hu · Yinlam Chow · Peter J. Ramadge · Karthik Narasimhan -
2021 Poster: SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark »
Victor Zhong · Austin W. Hanjie · Sida Wang · Karthik Narasimhan · Luke Zettlemoyer -
2020 : Invited talk - Bringing Back Text Understanding into Text-based Games - Karthik Narasimhan »
Karthik Narasimhan -
2020 Poster: Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation »
Zhiwei Deng · Karthik Narasimhan · Olga Russakovsky -
2019 Poster: A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation »
Runzhe Yang · Xingyuan Sun · Karthik Narasimhan -
2018 : Harnessing the synergy between natural language and interactive learning »
Karthik Narasimhan