Timezone: »
In Vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R. Our dataset and code are available at https://github.com/hekj/Landmark-RxR.
Author Information
Keji He (Institute of Automation,Chinese Academy of Sciences)
Yan Huang (CRIPAC, CASIA)
Qi Wu (University of Adelaide)
Jianhua Yang (Beijing University of Post and Telecommunication)
Dong An
Shuanglin Sima
Liang Wang (NLPR, China)
More from the Same Authors
-
2022 Spotlight: MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching »
Yan Huang · Yuming Wang · Yunan Zeng · Liang Wang -
2022 Poster: Learning Distinct and Representative Modes for Image Captioning »
Qi Chen · Chaorui Deng · Qi Wu -
2022 Poster: MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching »
Yan Huang · Yuming Wang · Yunan Zeng · Liang Wang -
2021 Poster: Debiased Visual Question Answering from Feature and Sample Perspectives »
Zhiquan Wen · Guanghui Xu · Mingkui Tan · Qingyao Wu · Qi Wu -
2020 Poster: Unfolding the Alternating Optimization for Blind Super Resolution »
zhengxiong luo · Yan Huang · Shang Li · Liang Wang · Tieniu Tan -
2020 Poster: Language and Visual Entity Relationship Graph for Agent Navigation »
Yicong Hong · Cristian Rodriguez · Yuankai Qi · Qi Wu · Stephen Gould -
2015 Poster: Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution »
Yan Huang · Wei Wang · Liang Wang -
2013 Poster: Relevance Topic Model for Unstructured Social Group Activity Recognition »
Fang Zhao · Yongzhen Huang · Liang Wang · Tieniu Tan