Timezone: »

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
Keji He · Yan Huang · Qi Wu · Jianhua Yang · Dong An · Shuanglin Sima · Liang Wang

Wed Dec 08 12:30 AM -- 02:00 AM (PST) @

In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R. Our dataset and code are available at https://github.com/hekj/Landmark-RxR.

Author Information

Keji He (Institute of Automation´╝îChinese Academy of Sciences)
Qi Wu (University of Adelaide)
Jianhua Yang (Beijing University of Post and Telecommunication)
Dong An
Shuanglin Sima
Liang Wang (NLPR, China)

More from the Same Authors