Timezone: »
Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention ({\em a.k.a. visual grounding}) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves previous attention, taking into account recency, that is most relevant for the current question, in order to resolve potentially ambiguous reference(s). The model then merges the retrieved attention with the tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in the situation where the visual reference resolution plays an important role. Moreover, the proposed model presents superior performance (~2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.
Author Information
Paul Hongsuck Seo (POSTECH)
Andreas Lehrmann (Disney Research)
Bohyung Han (Seoul National University)
Leonid Sigal (University of British Columbia)
More from the Same Authors
-
2021 Poster: Learning Student-Friendly Teacher Networks for Knowledge Distillation »
Dae Young Park · Moon-Hyun Cha · changwook jeong · Daesin Kim · Bohyung Han -
2021 Poster: Learning Debiased and Disentangled Representations for Semantic Segmentation »
Sanghyeok Chu · Dongwan Kim · Bohyung Han -
2019 Poster: Combinatorial Inference against Label Noise »
Paul Hongsuck Seo · Geeho Kim · Bohyung Han -
2017 Poster: Non-parametric Structured Output Networks »
Andreas Lehrmann · Leonid Sigal -
2017 Poster: Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization »
Hyeonwoo Noh · Tackgeun You · Jonghwan Mun · Bohyung Han -
2015 Poster: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation »
Seunghoon Hong · Hyeonwoo Noh · Bohyung Han -
2015 Spotlight: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation »
Seunghoon Hong · Hyeonwoo Noh · Bohyung Han -
2014 Poster: Object Localization based on Structural SVM using Privileged Information »
Jan Feyereisl · Suha Kwak · Jeany Son · Bohyung Han -
2014 Poster: A Unified Semantic Embedding: Relating Taxonomies and Attributes »
Sung Ju Hwang · Leonid Sigal -
2013 Poster: Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization »
Nataliya Shapovalova · Michalis Raptis · Leonid Sigal · Greg Mori -
2011 Poster: Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines »
Matthew D Zeiler · Graham Taylor · Leonid Sigal · Iain Matthews · Rob Fergus