Poster
in
Workshop: Agent Learning in Open-Endedness Workshop

Vision-Language Models as a Source of Rewards

Harris Chan ⋅ Volodymyr Mnih ⋅ Feryal Behbahani ⋅ Michael Laskin ⋅ Luyu Wang ⋅ Fabio Pardo ⋅ Maxime Gazeau ⋅ Himanshu Sahni ⋅ Daniel Horgan ⋅ Kate Baumli ⋅ Yannick Schroecker ⋅ Stephen Spencer ⋅ Richie Steigerwald ⋅ John Quan ⋅ Gheorghe Comanici ⋅ Sebastian Flennerhag ⋅ Alexander Neitz ⋅ Lei Zhang ⋅ Tom Schaul ⋅ Satinder Singh ⋅ Clare Lyle ⋅ Tim Rocktäschel ⋅ Jack Parker-Holder ⋅ Kristian Holsheimer

Keywords: goal-conditioned reinforcement learning generalist agents reward modeling

Project Page [ Poster] [ OpenReview]

Abstract

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.

Video

Chat is not available.