Timezone: »
This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the “few-shot” quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex Chen et al. (2021b), a language model with no prior knowledge of the domains on which we evaluate it.
Author Information
Ethan Brooks (University of Michigan)
Logan Walls (University of Michigan - Ann Arbor)
Richard L Lewis (University of Michigan)
Satinder Singh (DeepMind)
More from the Same Authors
-
2021 Spotlight: Proper Value Equivalence »
Christopher Grimm · Andre Barreto · Greg Farquhar · David Silver · Satinder Singh -
2021 Spotlight: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh -
2021 : GrASP: Gradient-Based Affordance Selection for Planning »
Vivek Veeriah · Zeyu Zheng · Richard L Lewis · Satinder Singh -
2022 : In-context Reinforcement Learning with Algorithm Distillation »
Michael Laskin · Luyu Wang · Junhyuk Oh · Emilio Parisotto · Stephen Spencer · Richie Steigerwald · DJ Strouse · Steven Hansen · Angelos Filos · Ethan Brooks · Maxime Gazeau · Himanshu Sahni · Satinder Singh · Volodymyr Mnih -
2022 : Optimistic Meta-Gradients »
Sebastian Flennerhag · Tom Zahavy · Brendan O'Donoghue · Hado van Hasselt · András György · Satinder Singh -
2022 : In-context Reinforcement Learning with Algorithm Distillation »
Michael Laskin · Luyu Wang · Junhyuk Oh · Emilio Parisotto · Stephen Spencer · Richie Steigerwald · DJ Strouse · Steven Hansen · Angelos Filos · Ethan Brooks · Maxime Gazeau · Himanshu Sahni · Satinder Singh · Volodymyr Mnih -
2023 Poster: Optimistic Meta-Gradients »
Sebastian Flennerhag · Tom Zahavy · Brendan O'Donoghue · Hado van Hasselt · András György · Satinder Singh -
2023 Poster: A Definition of Continual Reinforcement Learning »
David Abel · Andre Barreto · Benjamin Van Roy · Doina Precup · Hado van Hasselt · Satinder Singh -
2023 Poster: Large Language Models can Implement Policy Iteration »
Ethan Brooks · Logan Walls · Richard L Lewis · Satinder Singh -
2023 Poster: Discovering Representations for Transfer with Successor Features and the Deep Option Keyboard »
Wilka Carvalho Carvalho · Andre Saraiva · Angelos Filos · Andrew Lampinen · Loic Matthey · Richard L Lewis · Honglak Lee · Satinder Singh · Danilo Jimenez Rezende · Daniel Zoran -
2023 Poster: Structured State Space Models for In-Context Reinforcement Learning »
Chris Lu · Yannick Schroecker · Albert Gu · Emilio Parisotto · Jakob Foerster · Satinder Singh · Feryal Behbahani -
2022 Poster: Palm up: Playing in the Latent Manifold for Unsupervised Pretraining »
Hao Liu · Tom Zahavy · Volodymyr Mnih · Satinder Singh -
2022 Poster: Approximate Value Equivalence »
Christopher Grimm · Andre Barreto · Satinder Singh -
2022 Poster: Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction »
Dilip Arumugam · Satinder Singh -
2021 : Reducing the Information Horizon of Bayes-Adaptive Markov Decision Processes via Epistemic State Abstraction »
Dilip Arumugam · Satinder Singh -
2021 : Bootstrapped Meta-Learning »
Sebastian Flennerhag · Yannick Schroecker · Tom Zahavy · Hado van Hasselt · David Silver · Satinder Singh -
2021 Poster: On the Expressivity of Markov Reward »
David Abel · Will Dabney · Anna Harutyunyan · Mark Ho · Michael Littman · Doina Precup · Satinder Singh -
2021 Poster: Reward is enough for convex MDPs »
Tom Zahavy · Brendan O'Donoghue · Guillaume Desjardins · Satinder Singh -
2021 Poster: Proper Value Equivalence »
Christopher Grimm · Andre Barreto · Greg Farquhar · David Silver · Satinder Singh -
2021 Poster: Discovery of Options via Meta-Learned Subgoals »
Vivek Veeriah · Tom Zahavy · Matteo Hessel · Zhongwen Xu · Junhyuk Oh · Iurii Kemaev · Hado van Hasselt · David Silver · Satinder Singh -
2021 Poster: Learning State Representations from Random Deep Action-conditional Predictions »
Zeyu Zheng · Vivek Veeriah · Risto Vuorio · Richard L Lewis · Satinder Singh -
2021 Oral: On the Expressivity of Markov Reward »
David Abel · Will Dabney · Anna Harutyunyan · Mark Ho · Michael Littman · Doina Precup · Satinder Singh -
2020 Poster: Discovering Reinforcement Learning Algorithms »
Junhyuk Oh · Matteo Hessel · Wojciech Czarnecki · Zhongwen Xu · Hado van Hasselt · Satinder Singh · David Silver -
2020 Poster: Meta-Gradient Reinforcement Learning with an Objective Discovered Online »
Zhongwen Xu · Hado van Hasselt · Matteo Hessel · Junhyuk Oh · Satinder Singh · David Silver -
2020 Poster: Learning to Play No-Press Diplomacy with Best Response Policy Iteration »
Thomas Anthony · Tom Eccles · Andrea Tacchetti · János Kramár · Ian Gemp · Thomas Hudson · Nicolas Porcel · Marc Lanctot · Julien Perolat · Richard Everett · Satinder Singh · Thore Graepel · Yoram Bachrach -
2020 Spotlight: Learning to Play No-Press Diplomacy with Best Response Policy Iteration »
Thomas Anthony · Tom Eccles · Andrea Tacchetti · János Kramár · Ian Gemp · Thomas Hudson · Nicolas Porcel · Marc Lanctot · Julien Perolat · Richard Everett · Satinder Singh · Thore Graepel · Yoram Bachrach -
2020 Poster: A Self-Tuning Actor-Critic Algorithm »
Tom Zahavy · Zhongwen Xu · Vivek Veeriah · Matteo Hessel · Junhyuk Oh · Hado van Hasselt · David Silver · Satinder Singh -
2020 Poster: On Efficiency in Hierarchical Reinforcement Learning »
Zheng Wen · Doina Precup · Morteza Ibrahimi · Andre Barreto · Benjamin Van Roy · Satinder Singh -
2020 Poster: The Value Equivalence Principle for Model-Based Reinforcement Learning »
Christopher Grimm · Andre Barreto · Satinder Singh · David Silver -
2020 Spotlight: On Efficiency in Hierarchical Reinforcement Learning »
Zheng Wen · Doina Precup · Morteza Ibrahimi · Andre Barreto · Benjamin Van Roy · Satinder Singh -
2019 Poster: Discovery of Useful Questions as Auxiliary Tasks »
Vivek Veeriah · Matteo Hessel · Zhongwen Xu · Janarthanan Rajendran · Richard L Lewis · Junhyuk Oh · Hado van Hasselt · David Silver · Satinder Singh -
2019 Poster: Hindsight Credit Assignment »
Anna Harutyunyan · Will Dabney · Thomas Mesnard · Mohammad Gheshlaghi Azar · Bilal Piot · Nicolas Heess · Hado van Hasselt · Gregory Wayne · Satinder Singh · Doina Precup · Remi Munos -
2019 Spotlight: Hindsight Credit Assignment »
Anna Harutyunyan · Will Dabney · Thomas Mesnard · Mohammad Gheshlaghi Azar · Bilal Piot · Nicolas Heess · Hado van Hasselt · Gregory Wayne · Satinder Singh · Doina Precup · Remi Munos -
2015 Poster: Action-Conditional Video Prediction using Deep Networks in Atari Games »
Junhyuk Oh · Xiaoxiao Guo · Honglak Lee · Richard L Lewis · Satinder Singh -
2015 Spotlight: Action-Conditional Video Prediction using Deep Networks in Atari Games »
Junhyuk Oh · Xiaoxiao Guo · Honglak Lee · Richard L Lewis · Satinder Singh -
2014 Poster: Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning »
Xiaoxiao Guo · Satinder Singh · Honglak Lee · Richard L Lewis · Xiaoshi Wang -
2013 Poster: Reward Mapping for Transfer in Long-Lived Agents »
Xiaoxiao Guo · Satinder Singh · Richard L Lewis -
2010 Poster: Reward Design via Online Gradient Ascent »
Jonathan D Sorg · Satinder Singh · Richard L Lewis