Workshop
Goal-Conditioned Reinforcement Learning
Benjamin Eysenbach · Ishan Durugkar · Jason Ma · Andi Peng · Tongzhou Wang · Amy Zhang
Room 206 - 207
Learning goal-directed behavior is one of the classical problems in AI, one that has received renewed interest in recent years and currently sits at the crossroads of many seemingly-disparate research threads: self-supervised learning , representation learning, probabilistic inference, metric learning, and duality.
Our workshop focuses on these goal-conditioned RL (GCRL) algorithms and their connections to different areas of machine learning. Goal-conditioned RL is exciting not just because of these theoretical connections with different fields, but also because it promises to lift some of the practical challenges with applying RL algorithms: users can specify desired outcomes with a single observation, rather than a mathematical reward function. As such, GCRL algorithms may be applied to problems varying from robotics to language models tuning to molecular design to instruction following.
Our workshop aims to bring together researchers studying the theory, methods, and applications of GCRL, researchers who might be well posed to answer questions such as:
1. How does goal-directed behavior in animals inform better GCRL algorithmic design?
2. How can GCRL enable more precise and customizable molecular generation?
3. Do GCRL algorithms provide an effective mechanism for causal reasoning?
4. When and how should GCRL algorithms be applied to precision medicine?
Schedule
Fri 7:00 a.m. - 8:00 a.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗 |
Fri 8:00 a.m. - 8:40 a.m.
|
Invited Talk: Jeff Clune
(
Talk
)
>
SlidesLive Video Title: Open-Ended and AI-Generating Algorithms in the Era of Foundation Models Abstract: Foundation models create exciting new opportunities in our longstanding quests to produce open-ended and AI-generating algorithms, meaning agents that can truly keep learning forever. In this talk I will share some of our recent work harnessing the power of foundation models to make progress in these areas, including taking advantage of different forms of being goal-conditioned. I will cover three of our more recent papers: (1) OMNI: Open-endedness via Models of human Notions of Interestingness, (2) Video Pre-Training (VPT), and (3) Thought Cloning: Learning to Think while Acting by Imitating Human Thinking. |
🔗 |
Fri 8:45 a.m. - 9:25 a.m.
|
Invited Talk: Reuth Mirsky
(
Talk
)
>
SlidesLive Video Title: Goal Recognition as RL - Fantastic Goals and Where to Find Them Bio: Reuth Mirsky is an Assistant Professor at the Computer Science Department at Bar Ilan University and head of the Goal-optimization using Learning and Decision-making (GOLD) lab. She received her PhD in 2019 from Ben Gurion University and was a postdoc at the University of Texas until 2022. In her research, Reuth is interested in the similarities and differences between AI and natural intelligence and how these can be used to extend AI. Reuth is an active member of the AI and HRI research communities and was selected as one of the 2020 Electrical Engineering and Computer Science (EECS) Rising Stars. https://sites.google.com/site/dekelreuth/ |
🔗 |
Fri 9:30 a.m. - 10:10 a.m.
|
Invited Talk: Olexandr Isayev
(
Talk
)
>
SlidesLive Video Title: Designing molecules with reinforcement learning agents and autonomous experimentation Bio: Olexandr Isayev is an Associate Professor in the Department of Chemistry at Carnegie Mellon University. In 2008, Olexandr received his Ph.D. in computational chemistry. He was a Postdoctoral Research Fellow at the Case Western Reserve University and a scientist at the government research lab. During 2016-2019, he was a faculty member at UNC Eshelman School of Pharmacy, the University of North Carolina at Chapel Hill. Olexandr received the “Emerging Technology Award” from the American Chemical Society (ACS) and the GPU computing award from NVIDIA. The research in his lab focuses on connecting artificial intelligence (AI) with chemical sciences and experiment automation. |
🔗 |
Fri 10:15 a.m. - 11:45 a.m.
|
Lunch Break
(
Break
)
>
Lunch Break |
🔗 |
Fri 11:45 a.m. - 12:30 p.m.
|
Panel Discussion
(
Discussion Panel
)
>
SlidesLive Video Panelists: * Jeff Clune * Reuth Mirsky * Olexandr Isayev * Yonatan Bisk * Susan Murphy Moderator: Benjamin Eysenbach |
🔗 |
Fri 12:30 p.m. - 1:10 p.m.
|
Invited Talk: Yonatan Bisk
(
Talk
)
>
SlidesLive Video Title: TBD Bio Yonatan Bisk is an assistant professor of computer science in Carnegie Mellon's Language Technologies Institute. His group works on grounded and embodied natural language processing, placing perception and interaction as central to how language is learned and understood. Previously, he received his PhD from the University of Illinois at Urbana-Champaign working on unsupervised Bayesian models of syntax, before spending time at USC's ISI (working on grounding), the University of Washington (for commonsense research), and Microsoft Research (for vision+language). |
🔗 |
Fri 1:15 p.m. - 1:45 p.m.
|
Spotlight Talks
(
Spotlight
)
>
SlidesLive Video Spotlight Talks: * GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models * Entity-Centric Reinforcement Learning for Object Manipulation from Pixels * Is feedback all you need? Leveraging natural language feedback in goal-conditioned RL * Causality in Goal Conditioned RL: Return to No Future? * Automata Conditioned Reinforcement Learning with Experience Replay * Empowering Clinicians with MeDT: A Framework for Sepsis Treatment |
🔗 |
Fri 1:45 p.m. - 2:25 p.m.
|
Invited Talk: Susan Murphy
(
Talk
)
>
SlidesLive Video Title: We used Reinforcement Learning; but did it work? Bio: Susan Murphy’s research focuses on improving sequential, individualized, decision making in digital health. She developed the micro-randomized trial for use in constructing digital health interventions; this trial design is in use across a broad range of health-related areas. Her lab works on online learning algorithms for developing personalized digital health interventions. Dr. Murphy is a member of the National Academy of Sciences and of the National Academy of Medicine, both of the US National Academies. In 2013 she was awarded a MacArthur Fellowship for her work on experimental designs to inform sequential decision making. She is a Fellow of the College on Problems in Drug Dependence, Past-President of Institute of Mathematical Statistics and a former editor of the Annals of Statistics. |
🔗 |
Fri 2:30 p.m. - 3:00 p.m.
|
Poster Session 2
(
Poster Session
)
>
Poster Session 2 The poster session ends at 5.30 PM |
🔗 |
-
|
Backward Learning for Goal-Conditioned Policies
(
Poster
)
>
link
Can we learn policies in reinforcement learning without rewards? Can we learn a policy just by trying to reach a goal state? We answer these questions positively by proposing a multi-step procedure that first learns a world model that goes backward in time, secondly generates goal-reaching backward trajectories, thirdly improves those sequences using shortest path finding algorithms, and finally trains a neural network policy by imitation learning. We evaluate our method on a deterministic maze environment where the observations are $64\times 64$ pixel bird's eye images and can show that it consistently reaches several goals.
|
Marc Höftmann · Jan Robine · Stefan Harmeling 🔗 |
-
|
Numerical Goal-based Transformers for Practical Conditions
(
Poster
)
>
link
Goal-conditioned reinforcement learning (GCRL) studies aim to apply trained agents in realistic environments. In particular, offline reinforcement learning is being studied as a way to reduce the cost of online interactions in GCRL. One such method is Decision Transformer (DT), which utilizes a numerical goal called "return-to-go" for superior performance. Since DT assumes an idealized environment, such as perfect knowledge of rewards, it is necessary to study an improved approach for real-world applications. In this work, we present various attempts and results for numerical goal-based transformers to operate under practical conditions. |
Seonghyun Kim · Samyeul Noh · Ingook Jang 🔗 |
-
|
Goal-Conditioned Recommendations of AI Explanations
(
Poster
)
>
link
The large-scale usage of Artificial Intelligence (AI) models has made it important to explain their outputs subject to requirements and goals for using these models. The definition of goals in Goal-conditioned Reinforcement Learning (GCRL) aligns with the task of recommending an appropriate explanation among Explainable AI (XAI) models like SHAP or LIME that is most interpretive for specific AI models. We focus on two goals of training random forest classifier to classify different training data in order to find appropriate explanations. SlateQ recommendation system is used for simulation where the underlying RecSim environment has a slate of documents with different quantity scores representing different goals. |
Saptarashmi Bandyopadhyay · Vibhu Agrawal · John Dickerson 🔗 |
-
|
Contrastive Difference Predictive Coding
(
Poster
)
>
link
Predicting and reasoning about the future lies at the heart of many time-series questions. For example, goal-conditioned reinforcement learning can be viewed as learning representations to predict which states are likely to be visited in the future. While prior methods have used contrastive predictive coding to model time series data, learning representations that encode long-term dependencies usually requires large amounts of data. In this paper, we introduce a temporal difference version of contrastive predictive coding that stitching together pieces of different time series data to decrease the amount of data required to learn to predict future events. We apply this representation learning method to derive an off-policy algorithm for goal-conditioned RL. Experiments demonstrate that, compared with prior RL methods, ours achieves higher success rates with less data, and can better cope with stochastic environments. |
Chongyi Zheng · Russ Salakhutdinov · Benjamin Eysenbach 🔗 |
-
|
GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models
(
Poster
)
>
link
Offline goal-conditioned RL (GCRL) offers a feasible paradigm to learn general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods have been restricted to model-free approaches, constraining their capacity to tackle limited data budgets and unseen goal generalization. In this work, we propose a novel two-stage model-based framework, Goal-conditioned Offline Planning (GOPlan), including (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, the prior policy is based on an advantage-weighted Conditioned Generative Adversarial Networks that exhibits distinct mode separation to overcome the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. Through experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals. |
Mianchu Wang · Rui Yang · Xi Chen · Meng Fang 🔗 |
-
|
Middle-Mile Logistics Through the Lens of Goal-Conditioned Reinforcement Learning
(
Poster
)
>
link
Middle-mile logistics describes the problem of routing parcels through a network of hubs, which are linked by a fixed set of trucks. The main challenge comes from the finite capacity of the trucks. The decision to allocate a parcel to a specific truck might block another parcel from using the same truck. It is thus necessary to solve for all parcel routes simultaneously. Exact solution methods scale poorly with the problem size and real-world instances are intractable. Instead, we turn to reinforcement learning (RL) by rephrasing the middle-mile problem as a multi-object goal-conditioned Markov decision process. The key ingredients of our proposed method for parcel routing are the extraction of small feature graphs from the environment state and the combination of graph neural networks with model-free RL. There remain several open challenges and we provide an open-source implementation of the environment to encourage stronger cooperation between the reinforcement learning and logistics communities. |
Onno Eberhard · Thibaut Cuvelier · Michal Valko · Bruno De Backer 🔗 |
-
|
Automata Conditioned Reinforcement Learning with Experience Replay
(
Poster
)
>
link
We explore the problem of goal-conditioned reinforcement learning (RL) where goals are represented using deterministic finite state automata (DFAs). Due to the sparse and binary nature of automata-based goals, we hypothesize that experience replay can help an RL agent learn more quickly in this setting. To enable the use of experience replay, we use an end-to-end neural policy, including a graph neural network (GNN) to summarize the DFA goal before feeding it to a policy network. Experimental results in a gridworld domain demonstrate the efficacy of the model architecture and highlight the significant role of experience replay in enhancing the learning speed and reducing the variance of RL agents for DFA tasks. |
Niklas Lauffer · Beyazit Yalcinkaya · Marcell Vazquez-Chanlatte · Sanjit Seshia 🔗 |
-
|
Empowering Clinicians with MeDT: A Framework for Sepsis Treatment
(
Poster
)
>
link
Offline reinforcement learning has shown promise for solving tasks in safety-critical settings, such as clinical decision support. Its application, however, has been limited by the need for interpretability and interactivity for clinicians. To address these challenges, we propose medical decision transformer (MeDT), a novel and versatile framework based on the goal-conditioned reinforcement learning (RL) paradigm for sepsis treatment recommendation. MeDT is based on the decision transformer architecture, and conditions the model on expected treatment outcomes, hindsight patient acuity scores, past dosages and the patient’s current and past medical state at every timestep. This allows it to consider the complete context of a patient’s medical history, enabling more informed decision-making. By conditioning the policy’s generation of actions on user-specified goals at every timestep, MeDT enables clinician interactability while avoiding the problem of sparse rewards. Using data from the MIMIC-III dataset, we show that MeDT produces interventions that outperform or are competitive with existing methods while enabling a more interpretable, personalized and clinician-directed approach. |
Aamer Abdul Rahman · Pranav Agarwal · Vincent Michalski · Rita Noumeir · Samira Ebrahimi Kahou 🔗 |
-
|
Is feedback all you need? Leveraging natural language feedback in goal-conditioned RL
(
Poster
)
>
link
Despite numerous successes, reinforcement learning is still far from replicating the power and flexibility of behaviour learning in humans. One way to help bridge this gap may be to provide learning agents with richer, more humanlike feedback signals in the form of natural language. We adapt the decision transformer architecture to train agents on the BabyAI environment suite using two different types of generated language feedback, and compare the effect of using language feedback in place of return-to-go and goal description conditioning. |
Sabrina McCallum · Max Taylor-Davies · Stefano Albrecht · Alessandro Suglia 🔗 |
-
|
Universal Visual Decomposer: Long-Horizon Manipulation Made Easy
(
Poster
)
>
link
Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long-horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework. |
Zichen "Charles" Zhang · Yunshuang Li · Osbert Bastani · Abhishek Gupta · Dinesh Jayaraman · Jason Ma · Luca Weihs 🔗 |
-
|
Score-Models for Offline Goal-Conditioned Reinforcement Learning
(
Poster
)
>
link
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions. Offline GCRL is pivotal for developing generalist agents capable of leveraging pre-existing datasets to learn diverse and reusable skills without hand-engineering reward functions. However, contemporary approaches to GCRL based on supervised learning and contrastive learning are often suboptimal in the offline setting. An alternative perspective on GCRL optimizes for occupancy matching, but necessitates learning a discriminator, which subsequently serves as a pseudo-reward for downstream RL. Inaccuracies in the learned discriminator can cascade, negatively influencing the resulting policy. We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe. The key insight is combining the occupancy matching perspective of GCRL with a convex dual formulation to derive a learning objective that can better leverage suboptimal offline data. SMORe learns scores or unnormalized densities representing the importance of taking an action at a state for reaching a particular goal. SMORe is principled and our extensive experiments on the fully offline GCRL benchmark comprised of robot manipulation and locomotion tasks, including high-dimensional observations, show that SMORe can outperform state-of-the-art baselines by a significant margin. |
Harshit Sushil Sikchi · Rohan Chitnis · Ahmed Touati · Alborz Geramifard · Amy Zhang · Scott Niekum 🔗 |
-
|
Entity-Centric Reinforcement Learning for Object Manipulation from Pixels
(
Poster
)
>
link
Manipulating objects is a hallmark of human intelligence, and an important task in domains such as robotics. In principle, Reinforcement Learning (RL) offers a general approach to learn object manipulation. In practice, however, domains with more than a few objects are difficult for RL agents due to the curse of dimensionality, especially when learning from raw image observations. In this work we propose a structured approach for visual RL that is suitable for representing multiple objects and their interaction, and use it to learn goal-conditioned manipulation of several objects. Key to our method is the ability to handle goals with dependencies between the objects (e.g., moving objects in a certain order). We further relate our architecture to the generalization capability of the trained agent, and demonstrate agents that learn with 3 objects but generalize to similar tasks with over 10 objects. Rollout videos are available on our website: https://sites.google.com/view/entity-centric-rl |
Dan Haramati · Tal Daniel · Aviv Tamar 🔗 |
-
|
Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets
(
Poster
)
>
link
Despite the recent advancements in offline reinforcement learning via supervised learning (RvS) methods and the success of the decision transformer (DT) architecture in various domains, DTs have proven to fall short in challenging benchmarks. The root cause of this underperformance lies in their inability to seamlessly connect segments of suboptimal trajectories, i.e., stitch, leading to poor performance. To overcome these limitations, we present a novel approach to enhance RvS methods by integrating intermediate targets. We introduce the waypoint transformer (WT), using an architecture that builds upon the DT framework and is further conditioned on dynamically-generated waypoints. The results show a significant improvement in the final return compared to existing RvS methods, with performance on par or greater than existing temporal difference learning-based methods. Additionally, the performance and stability is significantly improvedin the most challenging environments and data configurations, including AntMaze Large Play/Diverse and Kitchen Mixed/Partial. |
Anirudhan Badrinath · Allen Nie · Yannis Flet-Berliac · Emma Brunskill 🔗 |
-
|
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
(
Poster
)
>
link
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call **Metric-Aware Abstraction** (**METRA**). Our main idea is, instead of directly covering the state space, to only cover a compact latent space $\mathcal{Z}$ that is *metrically* connected to the state space $\mathcal{S}$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the *first* unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and video are available at https://sites.google.com/view/metra0
|
Seohong Park · Oleh Rybkin · Sergey Levine 🔗 |
-
|
Efficient Value Propagation with the Compositional Optimality Equation
(
Poster
)
>
link
Goal-Conditioned Reinforcement Learning (GCRL) is about learning to reach predefined goal states. GCRL in the real world is crucial for adaptive robotics. Existing GCRL methods, however, suffer from low sample efficiency and high cost of collecting real-world data. Here we introduce the Compositional Optimality Equation (COE) for a widely used class of deterministic environments in which the reward is obtained only upon reaching a goal state. COE represents a novel alternative to the standard Bellman Optimality Equation, leading to more sample-efficient update rules. The Bellman update combines the immediate reward and the bootstrapped estimate of the best next state. Our COE-based update rule, however, combines the best composition of two bootstrapped estimates reflecting an arbitrary intermediate subgoal state. In tabular settings, the new update rule guarantees convergence to the optimal value function exponentially faster than the Bellman update! COE can also be used to derive compositional variants of conventional (deep) RL. In particular, our COE-based version of DDPG is more sample-efficient than DDPG in a continuous grid world. |
Piotr Piękos · Aditya Ramesh · Francesco Faccio · Jürgen Schmidhuber 🔗 |
-
|
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
(
Poster
)
>
link
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. All resources, including our model weights, training scripts, and evaluation tools are made available for further research. |
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith 🔗 |
-
|
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View.
(
Poster
)
>
link
Some reinforcement learning (RL) algorithms have the capability of recombining together pieces of previously seen experience to solve a task never seen before during training. This oft-sought property is one of the few ways in which dynamic programming based RL algorithms are considered different from supervised learning (SL) based RL algorithms. Yet, recent RL methods based on off-the-shelf SL algorithms achieve excellent results without an explicit mechanism for stitching; it remains unclear whether those methods forgo this important stitching property. This paper studies this question in the setting of goal-reaching problems. We show that the desirable stitching property corresponds to a form of generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen \emph{together} in the training data. Our analysis shows that this sort of generalization is different from \emph{i.i.d.} generalization. This connection between stitching and generalization reveals why we should not expect existing RL methods based on SL to perform stitching, even in the limit of large datasets and models. We experimentally validate this result on carefully constructed datasets.This connection suggests a simple remedy, the same remedy for improving generalization in supervised learning: data augmentation. We propose a naive \emph{temporal} data augmentation approach and demonstrate that adding it to RL methods based on SL enables them to stitch together experience so that they succeed in navigating between states and goals unseen together during training. |
Raj Ghugare · Matthieu Geist · Glen Berseth · Benjamin Eysenbach 🔗 |
-
|
Goal Misgeneralization as Implicit Goal Conditioning
(
Poster
)
>
link
While many examples of goal misspecification have been dissected in the reinforcement learning literature, few works have focused on the relatively new goal misgeneralization. As goal misgeneralization often stems from underspecification, we explore a simple environment with some goals specifiable through explicit conditioning, and others not. We find that agents generally pursue a mixture of possible goals, and the choice of goal to pursue is often inexplicable. Nonetheless, we attempt an explanation of implicit goal conditioning -- wherein subtle environment features determine which goal is pursued -- and aim to understand which features induce pursuit of one goal over another. |
Diego Dorn · Neel Alex · David Krueger 🔗 |
-
|
Bi-Directional Goal-Conditioning on Single Value Function for State Space Search Problems
(
Poster
)
>
link
State space search problems have a binary (found/not found) reward system. In our work, we assume the ability to sample goal states and use the same to define a forward task $(\tau^*)$ and a backward task $(\tau^{inv})$ derived from the original state space search task to ensure more useful and learnable samples. Similar to Hindsight Relabelling, we define 'Foresight Relabelling' for reverse trajectories. We also use the agent's ability (from the policy function) to evaluate the reachability of intermediate states and use these states as goals for new sub-tasks. We group these tasks and sample generation strategies and make a single policy function (DQN) using goal-conditioning to learn all these different tasks and call it 'SRE-DQN’ (Scrambler-Resolver-Explorer). Finally, we demonstrate the advantages of bi-directional goal-conditioning and knowledge of the goal state by evaluating our framework on classical goal-reaching tasks, and comparing with existing concepts extended to our bi-directional setting.
|
Vihaan Akshaay Rajendiran · Yu-Xiang Wang · Lei Li 🔗 |
-
|
Causality of GCRL: Return to No Future?
(
Poster
)
>
link
Recent work has demonstrated remarkable effectiveness of formulating Reinforcement Learning (RL) objectives as supervised learning problems. The primary motivation of goal-conditioned RL (GCRL) is to learn actions which maximize the conditional probability of achieving the desired return. In order to accomplish this, GCRL strives to estimate the conditional probability of actions (A = a) given states (S = s) and future rewards (R= r), which can be expressed as P (a | s, r) = P (s, a, r)/P (s, r). Subsequently, the optimal action aims to maximize an estimate of P (a | s, r). One critical insight missing in both empirical and theoretical work on GCRL pertains to the causality of incorporating information about the future into the training process. Selection bias is a fundamental issue in achieving valid causal inference. It occurs when units in the population are preferentially included or, more broadly, when conditioning on a collider variable. When conditioned on, colliders introduce spurious associations between variables that share a common descendant. This can lead to an agent learning a biased policy, which is based on spurious associations. In this work, we make a first attempt at investigating an important question for safe and robust decision making: what are the causal limitations of GCRL algorithms, and do they result in learning a biased policy? We examine GCRL via experiments in a complete (all variables known and measured) and incomplete (unknown and unmeasured variables exist) graphical models. |
Ivana Malenica · Susan Murphy 🔗 |
-
|
Hierarchical Empowerment: Toward Tractable Empowerment-Based Skill Learning
(
Poster
)
>
link
General purpose agents will require large repertoires of skills. Empowerment---the maximum mutual information between skills and states---provides a pathway for learning large collections of distinct skills, but mutual information is difficult to optimize. We introduce a new framework, Hierarchical Empowerment, that makes computing empowerment more tractable by integrating concepts from Goal-Conditioned Hierarchical Reinforcement Learning. Our framework makes two specific contributions. First, we introduce a new variational lower bound on mutual information that can be used to compute empowerment over short horizons. Second, we introduce a hierarchical architecture for computing empowerment over exponentially longer time scales. We verify the contributions of the framework in a series of simulated robotics tasks. In a popular ant navigation domain, our four level agents are able to learn skills that cover a surface area over two orders of magnitude larger than prior work. |
Andrew Levy · Sreehari Rammohan · Alessandro Allievi · Scott Niekum · George Konidaris 🔗 |
-
|
Goal-Conditioned Predictive Coding for Offline Reinforcement Learning
(
Poster
)
>
link
Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories. However, the benefits of performing sequence modeling on trajectory data are not yet clear. In this work, we investigate whether sequence modeling has the ability to condense trajectories into useful representations that enhance policy learning. To achieve this, we adopt a two-stage framework that first summarizes trajectories using sequence modeling techniques, and then leverages trajectory representations to learn a policy along with a desired goal. This design allows many existing supervised offline RL methods to be considered as specific instances of our framework. Within this framework, we introduce Goal-Conditioned Predicitve Coding (GCPC), an approach that brings powerful trajectory representations and leads to performant policies. Through extensive empirical evaluations on AntMaze, FrankaKitchen and Locomotion environments, we observe that sequence modeling has a significant impact on some decision making tasks. In addition, we demonstrate that GCPC learns a goal-conditioned latent representation about the future trajectory, which enables competitive performance on all three benchmarks. |
Zilai Zeng · Ce Zhang · Shijie Wang · Chen Sun 🔗 |
-
|
Does Hierarchical Reinforcement Learning Outperform Standard Reinforcement Learning in Goal-Oriented Environments?
(
Poster
)
>
link
Hierarchical Reinforcement Learning (HRL) targets long-horizon decision-making problems by decomposing the task into a hierarchy of subtasks. There is a plethora of HRL works that can do bottom-up temporal abstraction automatically meanwhile learning a hierarchical policy. In this study, we assess performance of standard RL and HRL within a customizable 2D Minecraft domain with varying difficulty levels. We observed that without a-prior knowledge, predefined subgoal structures and well-shaped reward structures, HRL methods surprisingly do not outperform all standard RL methods in 2D Minecraft domain.We also provide clues to elucidate the underlying reasons for this outcome, e.g., whether HRL methods, incorporating automatic temporal abstraction, can discover bottom-up action abstractions that match the intrinsic top-down task decomposition, often referred to as "goal-directed behavior" in goal-oriented environments. |
Ziyan Luo · Yijie Zhang · Zhaoyue(Rebecca) Wang 🔗 |
-
|
Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
(
Poster
)
>
link
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. |
Kevin Black · Mitsuhiko Nakamoto · Pranav Atreya · Homer Walke · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
Using Proto-Value Functions for Curriculum Generation in Goal-Conditioned RL
(
Poster
)
>
link
In this paper, we investigate the use of Proto Value Functions (PVFs) for measuring the similarity between tasks in the context of Curriculum Learning (CL). PVFs serve as a mathematical framework for generating basis functions for the state space of a Markov Decision Process (MDP). They capture the structure of the state space manifold and have been shown to be useful for value function approximation in Reinforcement Learning (RL). We show that even a few PVFs allow us to estimate the similarity between tasks. Based on this observation, we introduce a new algorithm called Curriculum Representation Policy Iteration (CRPI) that uses PVFs for CL, and we provide a proof of concept in a Goal-Conditioned Reinforcement Learning (GCRL) setting. |
Henrik Metternich · Ahmed Hendway · Pascal Klink · Jan Peters · Carlo DEramo 🔗 |
-
|
Asymmetric Norms to Approximate the Minimum Action Distance
(
Poster
)
>
link
This paper presents a state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Unlike previous methods, our approach incorporates an asymmetric norm parametrization, enabling accurate approximations of minimum action distances in environments with inherent asymmetry. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning. To validate our approach, we conduct empirical experiments on both symmetric and asymmetric environments. Our results show that our asymmetric norm parametrization performs comparably to symmetric norms in symmetric environments and surpasses symmetric norms in asymmetric environments. |
Lorenzo Steccanella · Anders Jonsson 🔗 |
-
|
Simple Data Sharing for Multi-Tasked Goal-Oriented Problems
(
Poster
)
>
link
Many important sequential decision problems -- from robotics, games to logistics -- are multi-tasked and goal-oriented. In this work, we frame them as Contextual Goal Oriented (CGO) problems, a goal-reaching special case of the contextual Markov decision process. CGO is a framework for designing multi-task agents that can follow instructions (represented by contexts) to solve goal-oriented tasks. We show that CGO problem can be systematically tackled using datasets that are commonly obtainable: an unsupervised interaction dataset of transitions and a supervised dataset of context-goal pairs. Leveraging the goal-oriented structure of CGO, we propose a simple data sharing technique that can provably solve CGO problems offline under natural assumptions on the datasets' quality. While an offline CGO problem is a special case of offline reinforcement learning (RL) with unlabelled data, running a generic offline RL algorithm here can be overly conservative since the goal-oriented structure of CGO is ignored. In contrast, our approach carefully constructs an augmented Markov Decision Process (MDP) to avoid introducing unnecessary pessimistic bias. In the experiments, we demonstrate our algorithm can learn near-optimal context-conditioned policies in simulated CGO problems, outperforming offline RL baselines. |
Ying Fan · Jingling Li · Adith Swaminathan · Aditya Modi · Ching-An Cheng 🔗 |
-
|
Stabilizing Contrastive RL: Techniques for Robotic Goal Reaching from Offline Data
(
Poster
)
>
link
Robotic systems that rely primarily on self-supervised learning have the potential to decrease the amount of human annotation and engineering effort required to learn control strategies. In the same way that prior robotic systems have leveraged self-supervised techniques from computer vision (CV) and natural language processing (NLP), our work builds on prior work showing that the reinforcement learning (RL) itself can be cast as a self-supervised problem: learning to reach any goal without human-specified rewards or labels. Despite the seeming appeal, little (if any) prior work has demonstrated how self-supervised RL methods can be practically deployed on robotic systems. By first studying a challenging simulated version of this task, we discover design decisions about architectures and hyperparameters that increase the success rate by $2 \times$. These findings lay the groundwork for our main result: we demonstrate that a self-supervised RL algorithm based on contrastive learning can solve real-world, image-based robotic manipulation tasks, with tasks being specified by a single goal image provided after training.
|
Chongyi Zheng · Benjamin Eysenbach · Homer Walke · Patrick Yin · Kuan Fang · Russ Salakhutdinov · Sergey Levine 🔗 |
-
|
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
(
Poster
)
>
link
We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. |
Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian (Shawn) Ma · Anji Liu · Yitao Liang 🔗 |
-
|
Multi-Resolution Skill Discovery for Hierarchical Reinforcement Learning
(
Poster
)
>
link
Learning abstract actions can be beneficial for goal-conditioned reinforcement learning.Offline discovery of primitives has effectively leveraged large static datasets in reinforcement learning.While using abstract skills has performed well, the agents usually lack finesse in motion.Humans and animals, in contrast, can learn motor skills at different levels of temporal resolution, fine-grained skills such as piano playing, or gross skills such as running.We propose a solution to the problem of representing multiple temporal resolutions to enhance skill abstraction.We do so by encoding multiple temporal resolutions of skills and through an appropriate choice mechanism learned by an actor-critic framework.Our work builds on top of a recent work by Director and shows improved performance.We evaluate the method on the DeepMind control suite task 'walker_walk', resulting in qualitative and quantitative performance gains. |
Shashank Sharma · Vinay Namboodiri · Janina A. Hoffmann 🔗 |
-
|
An Investigation into Value-Implicit Pre-training for Task-Agnostic, Sample-Efficient Reinforcement Learning
(
Poster
)
>
link
One of the primary challenges of learning a diverse set of robotic manipulation skills from raw sensory observations is to learn a universal reward function that can be used for unseen tasks. To address this challenge, a recent breakthrough called value-implicit pre-training (VIP) has been proposed. VIP provides a self-supervised pre-trained visual representation that exhibits the capability to generate dense and smooth reward functions for unseen robotic tasks. In this paper, we explore the feasibility of VIP's goal-conditioned reward specification with the goal of achieving task-agnostic, sample-efficient reinforcement learning (RL). Our investigation involves an evaluation of online RL by means of VIP-generated rewards instead of human-crafted reward signals on goal-image-specified robotic manipulation tasks from Meta-World under a highly limited interaction. We find that VIP's goal-conditioned reward specification, including task-agnostic inherent features, can accelerate online RL when used in conjunction with sparse task-completion rewards after policy pre-training on a handful of demonstrations via behavior cloning, rather than when used alone. |
Samyeul Noh · Seonghyun Kim · Ingook Jang 🔗 |