Poster
in
Workshop: 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models

A$^2$Nav: Action-Aware Zero-Shot Robot Navigation Using Vision-Language Ability of Foundation Models

Peihao Chen · Xinyu Sun · Hongyan Zhi · Runhao Zeng · Thomas Li · Mingkui Tan · Chuang Gan

Keywords: Vision-and-Language Navigation foundation models Zero-Shot Learning

Project Page [ Poster] [ OpenReview]

Abstract

We tackle the challenging task of zero-shot vision-and-language navigation (ZS-VLN), where an agent learns to follow complex path instructions without annotated data. We introduce A$^2$Nav, an action-aware ZS-VLN method leveraging foundation models like GPT and CLIP. Our approach includes an instruction parser and an action-aware navigation policy. The parser breaks down complex instructions into action-aware sub-tasks, which are executed using the learned action-specific navigation policy. Extensive experiments show A$^2$Nav achieves promising ZS-VLN performance and even surpasses some supervised learning methods on R2R-Habitat and RxR-Habitat datasets.

Chat is not available.