Timezone: »
Due to the foveated nature of the human vision system, people can focus their visual attention on a small region of their visual field at a time, which usually contains only a single object. Estimating this object of attention in first-person (egocentric) videos is useful for many human-centered real-world applications such as augmented reality applications and driver assistance systems. A straightforward solution for this problem is to pick the object whose bounding box is hit by the gaze, where eye gaze point estimation is obtained from a traditional eye gaze estimator and object candidates are generated from an off-the-shelf object detector. However, such an approach can fail because it addresses the where and the what problems separately, despite that they are highly related, chicken-and-egg problems. In this paper, we propose a novel unified model that incorporates both spatial and temporal evidence in identifying as well as locating the attended object in firstperson videos. It introduces a novel Self Validation Module that enforces and leverages consistency of the where and the what concepts. We evaluate on two public datasets, demonstrating that Self Validation Module significantly benefits both training and testing and that our model outperforms the state-of-the-art.
Author Information
Zehua Zhang (Indiana University Bloomington)
Chen Yu (Indiana University)
David Crandall (Indiana University)
More from the Same Authors
-
2021 : Enhanced Zero-Resource Speech Challenge 2021: Language Modelling from Speech and Images + Q&A »
Ewan Dunbar · Alejandrina Cristia · Okko Räsänen · Bertrand Higy · Marvin Lavechin · Grzegorz Chrupała · Afra Alishahi · Chen Yu · Maureen De Seyssel · Tu Anh Nguyen · Mathieu Bernard · Nicolas Hamilakis · Emmanuel Dupoux -
2020 Workshop: BabyMind: How Babies Learn and How Machines Can Imitate »
Byoung-Tak Zhang · Gary Marcus · Angelo Cangelosi · Pia Knoeferle · Klaus Obermayer · David Vernon · Chen Yu -
2019 Poster: Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition »
Satoshi Tsutsui · Yanwei Fu · David Crandall -
2018 Poster: Toddler-Inspired Visual Object Learning »
Sven Bambach · David Crandall · Linda Smith · Chen Yu -
2017 : Panel Discussion »
Felix Hill · Olivier Pietquin · Jack Gallant · Raymond Mooney · Sanja Fidler · Chen Yu · Devi Parikh -
2017 : How infant learn to speak by interacting with the visual world? »
Chen Yu -
2016 Poster: Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles »
Stefan Lee · Senthil Purushwalkam · Michael Cogswell · Viresh Ranjan · David Crandall · Dhruv Batra