Poster
in
Workshop: 6th Robot Learning Workshop: Pretraining, Fine-Tuning, and Generalization with Large Scale Models

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

William Chen · Oier Mees · Aviral Kumar · Sergey Levine

Keywords: Minecraft Reinforcement Learning Embodied Control vision-language models Promptable Representations

Project Page [ OpenReview]

Abstract

Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. We would like to endow RL agents with a similar ability to use contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on RL tasks in Minecraft and find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating prompts used by our approach without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM.

Chat is not available.