This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators.
Fuwen Tan (University of Virginia)
I am a Ph.D. student in the Computer Science Department of University of Virginia (U.Va.), working with Dr. Vicente Ordonez on Vision and Language. I am especially interested in learning compositional representations of image and language, and their applications to visual recognition, retrieval, and synthesis.
Paola Cascante-Bonilla (University of Virginia)
Xiaoxiao Guo (IBM Research)
Hui Wu (IBM Research)
Song Feng (IBM Research)
Vicente Ordonez (University of Virginia)
I'm a tenure-track Assistant Professor in the Department of Computer Science at the University of Virginia and Visiting Professor at Adobe Research. Before this, I spent a year as visiting researcher at the Allen Institute for Artificial Intelligence (AI2) in Seattle. I received my PhD in Computer Science at the University of North Carolina at Chapel Hill in 2015 advised by Prof. Tamara Berg. Previously, I obtained an MS in Computer Science at Stony Brook University (SUNY) and an engineering degree at the Escuela Superior Politécnica del Litoral in Ecuador. I'm a recipient of a Best -Long- Paper Award at the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), and the Best Paper -Marr Prize- Award at the 2013 International Conference in Computer Vision (ICCV). I have also been recently awarded an IBM Faculty Award and a Google Faculty Research Award.
More from the Same Authors
2018 Poster: Dialog-based Interactive Image Retrieval »
Xiaoxiao Guo · Hui Wu · Yu Cheng · Steven Rennie · Gerald Tesauro · Rogerio Feris
2017 Poster: Dilated Recurrent Neural Networks »
Shiyu Chang · Yang Zhang · Wei Han · Mo Yu · Xiaoxiao Guo · Wei Tan · Xiaodong Cui · Michael Witbrock · Mark Hasegawa-Johnson · Thomas Huang