Timezone: »

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts
Raymond A. Yeh · Jinjun Xiong · Wen-Mei Hwu · Minh Do · Alex Schwing

Tue Dec 05 04:35 PM -- 04:50 PM (PST) @ Hall A

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection of the solution from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, we able to consider significantly more proposals and, due to the unified formulation, our approach does not rely on a successful first stage. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, our approach outperforms the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08 and 7.77 respectively.

Author Information

Raymond A. Yeh (University of Illinois at Urbana–Champaign)
Jinjun Xiong (IBM Research)
Wen-Mei Hwu
Minh Do (University of Illinois)
Alex Schwing (University of Illinois at Urbana-Champaign)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors