Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models

Pre-Trained Binocular ViTs for Image-Goal Navigation

Guillaume Bono · Leonid Antsfeld · Boris Chidlovskii · Philippe Weinzaepfel · Christian Wolf

Keywords: [ Navigation ] [ visual goal-oriented ] [ binocular perception ] [ Embodied AI ] [ End-to-end ]


Abstract:

Most recent work in visual goal-oriented navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact map-like representations that generalize to unseen environments and high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is given as an exemplar image (Image Goal), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem using two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception: wide-baseline relative pose estimation and visibility prediction in complex scenes. Our first pretext task, cross-view completion, is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and localization directly. We propose a new dual encoder making use of a binocular ViT model. Experiments show significant improvements on Image Goal navigation performance.

Chat is not available.