Workshop
Generalization in Planning (GenPlan '23)
Pulkit Verma · Siddharth Srivastava · Aviv Tamar · Felipe Trevizan
Room 238 - 239
This workshop aims to bridge highly active but largely parallel research communities, addressing the problem of generalizable and transferrable learning for all forms of sequential decision making (SDM), including reinforcement learning and AI planning. We expect that this workshop will play a key role in accelerating the speed of foundational innovation in SDM with a synthesis of the best ideas for learning generalizable representations of learned knowledge and for reliably utilizing the learned knowledge across different sequential decision-making problems. NeurIPS presents an ideal, inclusive venue for dialog and technical interaction among researchers spanning the vast range of research communities that focus on these topics.
Schedule
Sat 6:15 a.m. - 6:20 a.m.
|
Opening Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
Sat 6:20 a.m. - 6:55 a.m.
|
Causal Dynamics Learning for Task-Independent State Abstraction
(
Invited Talk
)
>
SlidesLive Video Learning dynamics models accurately is an important goal for Model-Based Reinforcement Learning (MBRL), but most MBRL methods learn a dense dynamics model which is vulnerable to spurious correlations and therefore generalizes poorly to unseen states. In this paper, we introduce Causal Dynamics Learning for Task-Independent State Abstraction (CDL), which first learns a theoretically proved causal dynamics model that removes unnecessary dependencies between state variables and the action, thus generalizing well to unseen states. A state abstraction can then be derived from the learned dynamics, which not only improves sample efficiency but also applies to a wider range of tasks than existing state abstraction methods. Evaluated on two simulated environments and downstream tasks, both the dynamics model and policies learned by the proposed method generalize well to unseen states and the derived state abstraction improves sample efficiency compared to learning without it. |
Peter Stone 🔗 |
Sat 6:55 a.m. - 7:05 a.m.
|
Learning Abstract World Models for Value-preserving Planning with Options
(
Contributed Talk
)
>
link
SlidesLive Video General-purpose agents require fine-grained controls and rich sensory inputs to perform a wide range of tasks. However, this complexity often leads to intractable decision-making. Traditionally, agents are provided with task-specific action and observation spaces to mitigate this challenge, but this reduces autonomy. Instead, agents must be capable of building state-action spaces at the correct abstraction level from their sensorimotor experiences. We leverage the structure of a given set of temporally-extended actions to learn abstract Markov decision processes (MDPs) that operate at a higher level of temporal and state granularity. We characterize state abstractions necessary to ensure that planning with these skills, by simulating trajectories in the abstract MDP, results in policies with bounded value loss in the original MDP.We evaluate our approach in goal-based navigation environments that require continuous abstract states to plan successfully and show that abstract model learning improves the sample efficiency of planning and learning. |
Rafael Rodriguez Sanchez · George Konidaris 🔗 |
Sat 7:05 a.m. - 7:15 a.m.
|
Reinforcement Learning with Augmentation Invariant Representation: A Non-contrastive Approach
(
Contributed Talk
)
>
link
SlidesLive Video Data augmentation has been proven as an effective measure to improve generalization performance in reinforcement learning (RL). However, recent approaches directly use the augmented data to learn the value estimate or regularize the estimation, often ignoring the core essence that the model needs to learn that augmented data indeed represents the same state. In this work, we present \textbf{RAIR}: \textbf{R}einforcement learning with \textbf{A}ugmentation \textbf{I}nvariant \textbf{R}epresentation that disentangles the representation learning task from the RL task and aims to learn similar latent representations for the original observation and the augmented one. Our approach learns the representation of high-dimensional visual observations in a non-contrastive self-supervised way combined with the standard RL objective. In particular, RAIR gradually pushes the latent representation of an observation closer to the representation produced for the corresponding augmented observations. Thus, our agent is more robust to the changes in the environment. We evaluate RAIR on all sixteen environments from the RL generalization benchmark Procgen. The experimental results indicate that RAIR outperforms other data augmentation-based approaches under the standard generalization evaluation protocol. |
Nasik Muhammad Nafi · William Hsu 🔗 |
Sat 7:15 a.m. - 7:25 a.m.
|
Explore to Generalize in Zero-Shot RL
(
Contributed Talk
)
>
link
SlidesLive Video We study zero-shot generalization in reinforcement learning - optimize a policy on a set of training tasks such that it will perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that explores the domain effectively is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our Explore to Generalize algorithm (ExpGen) builds on this insight: We train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which are guaranteed to generalize and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on several tasks in the ProcGen challenge that have so far eluded effective generalization. For example, we demonstrate a success rate of 82% on the Maze task and 74% on Heist with 200 training levels. |
Ev Zisselman · Itai Lavie · Daniel Soudry · Aviv Tamar 🔗 |
Sat 7:25 a.m. - 8:00 a.m.
|
Value-Based Abstractions for Planning
(
Invited Talk
)
>
SlidesLive Video As reinforcement learning continues to advance, the integration of efficient planning algorithms with powerful representation learning becomes crucial for solving long-horizon tasks. We address key challenges in planning, reward learning, and representation learning through the objective of learning value-based abstractions. We explore this idea via goal-conditioned reinforcement learning to learn generalizable value functions and action-free pre-training. By leveraging self-supervised reinforcement learning and efficient planning algorithms, these approaches collectively contribute to the advancement of decision-making systems capable of learning and adapting to diverse tasks in real-world environments. |
Amy Zhang 🔗 |
Sat 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Coffee Break
)
>
|
🔗 |
Sat 8:30 a.m. - 9:05 a.m.
|
Learning General Policies and Sketches
(
Invited Talk
)
>
SlidesLive Video Recent progress in deep learning and deep reinforcement learning (DRL) has been truly remarkable, yet two problems remain: structural policy generalization and policy reuse. The first is about getting policies that generalize in a reliable way; the second is about getting policies that can be reused and combined in a flexible, goal-oriented manner. The two problems are studied in DRL but only experimentally, and the results are not clear and crisp. In our work, we have tackled these problems in a slightly different way, developing languages for expressing general policies, and methods for learning them using combinatorial and DRL approaches. We have also developed languages for expressing and learning general subgoal structures (sketches) and hierarchical polices which are based on the notion of planning width. In the talk, I'll present the main ideas and results. This is joint work with Blai Bonet, Simon Ståhlberg, Dominik Drexler, and other members of the RLeap team. |
Hector Geffner 🔗 |
Sat 9:05 a.m. - 9:15 a.m.
|
GOOSE: Learning Domain-Independent Heuristics
(
Contributed Talk
)
>
link
SlidesLive Video We present three novel graph representations of planning tasks suitable for learning domain-independent heuristics using Graph Neural Networks (GNNs) to guide search. In particular, to mitigate the issues caused by large grounded GNNs we present the first method for learning domain-independent heuristics with only the lifted representation of a planning task. We also provide a theoretical analysis of the expressiveness of our models, showing that some are more powerful than STRIPS-HGN, the only other existing model for learning domain-independent heuristics. Our experiments show that our heuristics generalise to much larger problems than those in the training set, vastly surpassing STRIPS-HGN heuristics. |
Dillon Chen · Felipe Trevizan · Sylvie Thiebaux 🔗 |
Sat 9:15 a.m. - 9:25 a.m.
|
Hierarchical Reinforcement Learning with AI Planning Models
(
Contributed Talk
)
>
link
SlidesLive Video Deep Reinforcement Learning (DRL) has shown breakthroughs in solving challenging problems, such as pixel-based games and continuous control tasks. In complex environments, infusing prior domain knowledge is essential to achieve sample efficiency and generalization.Neuro-symbolic AI seeks systematic domain knowledge infusion into neural network-based learning, and existing neuro-symbolic approaches for sequential decision-making leverage hierarchical reinforcement learning (HRL) by infusing symbolically specified prior knowledge on desired trajectories.However, this requires finding symbolic solutions in RL environments before learning, and it is difficult to handle the divergence between unknown RL dynamics and prior knowledge. Such shortcomings result in loose and manual neuro-symbolic integration and degrade the generalization capability.In this paper, we integrate the options framework in HRL with an AI planning model to resolve the shortcomings in earlier approaches and generalize beyond RL environments where pre-specified partial solutions are valid.Our approach defines options from AI planning operators by establishing the connection between the two transition systems in the options framework and the AI planning task. Then, we show an option policy learning method that integrates an AI planner and model-free DRL algorithms with intrinsic rewards, encouraging consistency between the two transition systems. We design a suite of MiniGrid environments that cover the increasing levels of difficulties in exploration, where our empirical evaluation clearly shows the advantage of HRL with AI planning models. |
Junkyu Lee · Michael Katz · Don Joven Agravante · Miao Liu · Geraud Nangue Tasse · Tim Klinger · Shirin Sohrabi Araghi 🔗 |
Sat 9:25 a.m. - 9:35 a.m.
|
Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Stochastic Settings
(
Contributed Talk
)
>
link
SlidesLive Video Reinforcement Learning (RL) provides a convenient framework for sequential decision making when closed-form transition dynamics are unavailable and can frequently change. However, the high sample complexity of RL approaches limits their utility in the real-world. This paper presents an approach that performs meta-level exploration in the space of models and uses the learned models to compute policies. Our approach interleaves learning and planning allowing data-efficient, task-focused sample collection in the presence of non-stationarity. We conduct an empirical evaluation on benchmark domains and show that our approach significantly outperforms baselines in sample complexity and easily adapts to changing transition systems across tasks. |
Rushang Karia · Pulkit Verma · Gaurav Vipat · Siddharth Srivastava 🔗 |
Sat 9:35 a.m. - 9:45 a.m.
|
POMRL: No-Regret Learning-to-Plan with Increasing Horizons
(
Contributed Talk
)
>
link
SlidesLive Video We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically. |
Khimya Khetarpal · Claire Vernade · Brendan O'Donoghue · Satinder Singh · Tom Zahavy 🔗 |
Sat 9:45 a.m. - 9:55 a.m.
|
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
(
Contributed Talk
)
>
link
SlidesLive Video Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. We find that any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works with complex function approximators like neural networks, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice, since we show many environments can be solved by effectively estimating the random policy's Q-function and then applying zero or a few steps of value iteration. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead—which is typically much smaller than the full horizon—and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. |
Cassidy Laidlaw · Banghua Zhu · Stuart J Russell · Anca Dragan 🔗 |
Sat 9:55 a.m. - 11:30 a.m.
|
Lunch Break
(
Lunch Break
)
>
|
🔗 |
Sat 11:30 a.m. - 12:05 p.m.
|
Logic, Automata, and Games in Linear Temporal Logics on Finite Traces
(
Invited Talk
)
>
SlidesLive Video Temporal logics on finite traces (LTLf, LDLf, PPLTL, etc.) are increasingly attracting the interest of the scientific community. These logics are variants of temporal logics used for specifying dynamic properties in Formal Methods, but focussing on finite though unbounded traces. They are becoming popular in several areas, including AI planning for expressing temporally extended goals, reactive synthesis for automatically synthesizing interactive programs, reinforcement learning for expressing non-Markovian rewards and dynamics, and Business Process Modeling for declaratively specifying processes. These logics can express general safety and guarantee (reachability) properties, though they cannot talk about the behaviors at the infinitum as more traditional temporal logics on infinite traces. The key characteristic of these logics is that they can be reduced to equivalent regular automata, and in turn, automata, once determinized, into two-player games on graphs. This gives them unprecedented computational effectiveness and scalability. In this talk, we will look at these logics, their corresponding automata, and resulting games, and show their relevance in service composition. In particular, we show how they can be used for automatically synthesizing orchestrators for advanced forms of goal-oriented synthesis. |
Giuseppe De Giacomo 🔗 |
Sat 12:05 p.m. - 12:15 p.m.
|
Addressing Long-Horizon Tasks by Integrating Program Synthesis and State Machines
(
Contributed Talk
)
>
link
SlidesLive Video Deep reinforcement learning excels in various domains but lacks generalizability and interoperability. Programmatic RL (Trivedi et al., 2021; Liu et al., 2023) methods reformulate solving RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors and is difficult to be interpreted by human users. This work proposes Program Machine Policies (POMPs), which bridge the advantages of programmatic RL and state machine policies, allowing for representing complex behaviors and addressing long-horizon tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing long-horizon repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to inductively generalize to even longer horizons without any fine-tuning. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes. |
Yu-An Lin · Chen-Tao Lee · Guan-Ting Liu · Pu-Jen Cheng · Shao-Hua Sun 🔗 |
Sat 12:15 p.m. - 12:25 p.m.
|
PADDLE: Logic Program Guided Policy Reuse in Deep Reinforcement Learning
(
Contributed Talk
)
>
link
SlidesLive Video Learning new skills through previous experience is common in human life, which is the core idea of Transfer Reinforcement Learning (TRL). This requires the agent to learn \emph{when} and \emph{which} source policy is the best to reuse as the target task's policy, and \emph{how} to reuse the source policy. Most TRL methods learn, transfer, and reuse black-box policies, which is hard to explain 1) when to reuse, 2) which source policy is effective, and 3) reduces transfer efficiency. In this paper, we propose a novel TRL method called \textbf{P}rogr\textbf{A}m gui\textbf{D}e\textbf{D} po\textbf{L}icy r\textbf{E}use (PADDLE) that can measure the logic similarities between tasks and transfer knowledge with interpretable cause-effect logic to the target task. To achieve this, we first propose a hybrid decision model that synthesizes high-level logic programs and learns low-level DRL policy to learn multiple source tasks. Second, we estimate the logic similarity between the target task and the source tasks and combine it with the low-level policy similarity to select the appropriate source policy as the guiding policy for the target task. Experimental results show that our method can effectively select the appropriate source tasks to guide learning on the target task, outperforming black-box TRL methods. |
Hao Zhang · Tianpei Yang · YAN ZHENG · Jianye Hao · Matthew Taylor 🔗 |
Sat 12:25 p.m. - 1:00 p.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗 |
Sat 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Coffee Break
)
>
|
🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗 |
Sat 2:00 p.m. - 2:35 p.m.
|
In-Context Learning of Sequential Decision-Making Tasks
(
Invited Talk
)
>
SlidesLive Video Training autonomous agents that can learn new tasks from only a handful of demonstrations is a long-standing problem in machine learning. Recently, transformers have been shown to learn new language or vision tasks without any weight updates from only a few examples, also referred to as in-context learning. However, the sequential decision-making setting poses additional challenges having a lower tolerance for errors since the environment's stochasticity or the agent's actions can lead to unseen, and sometimes unrecoverable, states. In this talk, I will show that naively applying transformers to this setting does not enable in-context learning of new tasks. I will then show how different design choices such as the model size, data diversity, environment stochasticity, and trajectory burstiness, affect in-context learning of sequential decision-making tasks. Finally, I will show that by training on large diverse offline datasets, transformers are able to learn entirely new tasks with unseen states, actions, dynamics, and rewards, using only a handful of demonstrations and no weight updates. I will end my talk with a discussion of the limitations of offline learning approaches in sequential decision-making and some directions for future work. |
Roberta Raileanu 🔗 |
Sat 2:35 p.m. - 2:45 p.m.
|
RL3: Boosting Meta Reinforcement Learning via RL inside RL2
(
Contributed Talk
)
>
link
SlidesLive Video
Meta reinforcement learning (meta-RL) methods such as RL$^2$ have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, these RL algorithms struggle with long-horizon tasks and out-of-distribution tasks since they rely on recurrent neural networks to process the sequence of experiences instead of summarizing them into general RL components such as value functions. Moreover, even transformers have a practical limit to the length of histories they can efficiently reason about before training and inference costs become prohibitive. In contrast, traditional RL algorithms are data-inefficient since they do not leverage domain knowledge, but they do converge to an optimal policy as more data becomes available. In this paper, we propose RL$^3$, a principled hybrid approach that combines traditional RL and meta-RL by incorporating task-specific action-values learned through traditional RL as an input to the meta-RL neural network. We show that RL$^3$ earns greater cumulative reward on long-horizon and out-of-distribution tasks compared to RL$^2$, while maintaining the efficiency of the latter in the short term. Experiments are conducted on both custom and benchmark discrete domains from the meta-RL literature that exhibit a range of short-term, long-term, and complex dependencies.
|
Abhinav Bhatia · Samer Nashed · Shlomo Zilberstein 🔗 |
Sat 2:45 p.m. - 2:55 p.m.
|
Towards General-Purpose In-Context Learning Agents
(
Contributed Talk
)
>
link
SlidesLive Video Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
Sat 2:55 p.m. - 3:25 p.m.
|
Panel Discussion
(
Panel
)
>
link
SlidesLive Video |
🔗 |
Sat 3:25 p.m. - 3:30 p.m.
|
Closing Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
-
|
Massively Scalable Inverse Reinforcement Learning for Route Optimization
(
Poster
)
>
link
Optimizing for humans’ latent preferences remains a grand challenge in route recommendation. Prior research has provided increasingly general methods based on inverse reinforcement learning (IRL), yet no approach has successfully addressed planetary-scale routing problems with hundreds of millions of states and demonstration trajectories. In this paper, we introduce scaling techniques based on graph compression, spatial parallelization, and improved initialization conditions inspired by a connection to eigenvector algorithms. We revisit classic IRL algorithms in the routing context, and make the key observation that there exists a trade-off between the use of cheap, deterministic planners and expensive yet robust stochastic policies. This insight is leveraged in Receding Horizon Inverse Planning (RHIP), a new generalization of classic IRL algorithms that provides fine-grained control over performance trade-offs via its planning horizon. Our contributions culminate in a policy that achieves a 16-24% improvement in route quality at a global scale, and to the best of our knowledge, represents the largest published benchmark of IRL algorithms in a real-world setting to date. We conclude by conducting an ablation study of key components, presenting negative results from alternative eigenvalue solvers, and identifying opportunities to further improve scalability via IRL-specific batching strategies. |
Matt Barnes · Matthew Abueg · Oliver Lange · Matt Deeds · Jason Trader · Denali Molitor · Markus Wulfmeier · Shawn O'Banion 🔗 |
-
|
Reasoning with Language Model is Planning with World Model
(
Poster
)
>
link
Large language models (LLMs) have shown remarkable reasoning capabilities, particularly with chain-of-thought (CoT) prompting. However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math or logical reasoning. The deficiency stems from the key fact that LLMs lack an internal world model to predict the world state (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, Reasoning via Planning (RAP). RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monte Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and rewards, and efficiently obtains a high-reward reasoning path with a proper balance between exploration v.s. exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLaMA-33B surpasses CoT on GPT-4 with 33\% relative improvement in a plan generation setting. |
Shibo Hao · Yi Gu · Haodi Ma · Joshua Hong · Zhen Wang · Daisy Zhe Wang · Zhiting Hu 🔗 |
-
|
Robustness and Regularization in Reinforcement Learning
(
Poster
)
>
link
Robust Markov decision processes (MDPs) tackle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization, which can significantly increase computational complexity and limit scalability. On the other hand, policy regularization improves learning stability without impairing time complexity. Yet, it does not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that policy regularization methods solve a particular instance of robust MDPs with uncertain rewards. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We then introduce twice regularized MDPs ($\text{R}^2$ MDPs), i.e., MDPs with value *and* policy regularization. The corresponding Bellman operators lead to planning and learning schemes with convergence and generalization guarantees, thus reducing robustness to regularization. We numerically show this two-fold advantage on tabular and physical domains, and illustrate the persistent efficacy of \rr regularization.
|
Esther Derman · Yevgeniy Men · Matthieu Geist · Shie Mannor 🔗 |
-
|
Learning Generalizable Visual Task Through Interaction
(
Poster
)
>
link
We present a framework for robots to learn novel visual concepts and visual tasks via in-situ linguistic interactions with human users. Previous approaches in computer vision have either used large pre-trained visual models to infer novel objects zero-shot, or added novel concepts along with their attributes and representations to a concept hierarchy. We extend the approaches that focus on learning visual concept hierarchies and take this ability one step further to demonstrate novel task solving on robots along with the learned visual concepts. To enable a visual concept learner to solve robotics tasks one-shot, we developed two distinct techniques. Firstly, we propose a novel approach, Hi-Viscont(HIerarchical VISual CONcept learner for Task), which augments information of a novel concept, that is being taught, to its parent nodes within a concept hierarchy. This information propagation allows all concepts in a hierarchy to update as novel concepts are taught in a continual learning setting. Secondly, we represent a visual task as a scene graph with language annotations, allowing us to create novel permutations of a demonstrated task zero-shot in-situ.We compared Hi-Viscont with the baseline model (FALCON~\cite{mei2022falcon}) on visual question answering(VQA) in three domains.While being comparable to the baseline model on leaf level concepts, Hi-Viscont achieves an improvement of over 9% on non-leaf concepts on average.Additionally, we provide a demonstration where a human user teaches the robot visual tasks and concepts interactively.With these results we demonstrate the ability of our model to learn tasks and concepts in a continual learning setting on the robot. |
Weiwei Gu · Anant Sah · Nakul Gopalan 🔗 |
-
|
Non-adaptive Online Finetuning for Offline Reinforcement Learning
(
Poster
)
>
link
Offline reinforcement learning (RL) has emerged as an important framework for applying RL to real-life applications. However, the complete lack of online interactions causes technical difficulties, and the online finetuning setting incorporates a limited form of online interactions---which is often available in practice---to address these challenges. Unfortunately, current theoretical frameworks for online finetuning either assume high online sample complexity and/or require deploying fully adaptive algorithms (i.e., unlimited policy changes), which restricts their application to real-world settings where online interactions and policy updates are expensive and limited. In this paper, we develop a new framework for online finetuning. Instead of competing with the optimal policy (which inherits the high sample complexity and adaptivity requirements of online RL), we aim to learn a new policy that improves as much as possible over the existing policy using a pre-specified number of online samples and with a non-adaptive data-collection policy. Our formulation reveals surprising nuances and suggests novel principles that distinguishes the finetuning problem from purely online and offline RL. |
Audrey Huang · Mohammad Ghavamzadeh · Nan Jiang · Marek Petrik 🔗 |
-
|
Learning Interactive Real-World Simulators
(
Poster
)
>
link
Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different axes (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, UniSim can emulate how humans and agents interact with the world by simulating the visual outcome of both high-level instructions such as “open the drawer” and low-level controls such as “move by x,y” from otherwise static scenes and objects. There are numerous use cases for such a real-world simulator. As an example, we use UniSim to train both high-level vision-language planners and low-level reinforcement learning policies, each of which exhibit zero-shot real-world transfer after training purely in a learned real-world simulator. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience in UniSim, opening up even wider applications. |
Sherry Yang · Yilun Du · Kamyar Ghasemipour · Jonathan Tompson · Dale Schuurmans · Pieter Abbeel 🔗 |
-
|
Agent-Centric State Discovery for Finite-Memory POMDPs
(
Poster
)
>
link
Discovering an informative, or agent-centric, state representation that encodes only the relevant information while discarding the irrelevant is a key challenge towards scaling reinforcement learning algorithms and efficiently applying them to downstream tasks. Prior works studied this problem in high-dimensional Markovian environments, when the current observation may be a complex object but is sufficient to decode the informative state. In this work, we consider the problem of discovering the agent-centric state in the more challenging high-dimensional non-Markovian setting, when the state can be decoded from a sequence of past observations. We establish that generalized inverse models can be adapted for learning agent-centric state representation for this task. Our results include asymptotic theory as well as negative results for alternative intuitive algorithms, such as encoding with only a forward-running sequence model. We complement these findings with a thorough empirical study on the agent-centric state discovery abilities of the different alternatives we put forward. Particularly notable is our analysis of past actions, where we show that these can be a double-edged sword: making the algorithms more successful when used correctly and causing dramatic failure when used incorrectly. |
Lili Wu · Ben Evans · Riashat Islam · Raihan Seraj · Yonathan Efroni · Alex Lamb 🔗 |
-
|
Simple Data Sharing for Multi-Tasked Goal-Oriented Problems
(
Poster
)
>
link
Many important sequential decision problems -- from robotics, games to logistics -- are multi-tasked and goal-oriented. In this work, we frame them as Contextual Goal Oriented (CGO) problems, a goal-reaching special case of the contextual Markov decision process. CGO is a framework for designing multi-task agents that can follow instructions (represented by contexts) to solve goal-oriented tasks. We show that CGO problem can be systematically tackled using datasets that are commonly obtainable: an unsupervised interaction dataset of transitions and a supervised dataset of context-goal pairs. Leveraging the goal-oriented structure of CGO, we propose a simple data sharing technique that can provably solve CGO problems offline under natural assumptions on the datasets' quality. While an offline CGO problem is a special case of offline reinforcement learning (RL) with unlabelled data, running a generic offline RL algorithm here can be overly conservative since the goal-oriented structure of CGO is ignored. In contrast, our approach carefully constructs an augmented Markov Decision Process (MDP) to avoid introducing unnecessary pessimistic bias. In the experiments, we demonstrate our algorithm can learn near-optimal context-conditioned policies in simulated CGO problems, outperforming offline RL baselines. |
Ying Fan · Jingling Li · Adith Swaminathan · Aditya Modi · Ching-An Cheng 🔗 |
-
|
Leveraging Behavioral Cloning for Representation Alignment in Cross-Domain Policy Transfer
(
Poster
)
>
link
The limited transferability of learned policies is a major challenge that restricts the applicability of learning-based solutions in decision-making tasks. In this paper, we present a simple method for aligning latent state representations across different domains using unaligned trajectories of proxy tasks. Once the alignment process is completed, policies trained on the shared representation can be transferred to another domain without further interaction. Our key finding is that multi-domain behavioral cloning is a powerful means of shaping a shared latent space. We also observe that the commonly used domain discriminative objective for distribution matching can be overly restrictive, potentially disrupting the latent state structure of each domain. As an alternative, we propose to use maximum mean discrepancy for regularization. Since our method focuses on capturing shared structures, it does not require discovering the exact cross-domain correspondence that existing methods aim for. Furthermore, our approach involves training only a single multi-domain policy, making it easy to extend. We evaluate our method across various domain shifts, including cross-robot and cross-viewpoint settings, and demonstrate that our approach outperforms existing methods that employ adversarial domain translation. We also conduct ablation studies to investigate the effectiveness of each loss component for different domain shifts. |
Hayato Watahiki · Ryo Iwase · Ryosuke Unno · Yoshimasa Tsuruoka 🔗 |
-
|
Understanding Representations Pretrained with Auxiliary Losses for Embodied Agent Planning
(
Poster
)
>
link
Pretrained representations from large-scale vision models have boosted the performance of downstream embodied policy learning. We look to understand whether additional pretraining using common auxiliary losses in embodied AI can build on these general-purpose visual representations to better support planning in embodied tasks. We use a CLIP visual backbone and pretrain a visual compression module and the agent's state belief representations with four unsupervised auxiliary losses, two hindsight-based losses, and a standard imitation learning loss, on a fixed dataset of exploration trajectories. The learned representations are then frozen for downstream multi-step evaluation on two goal-directed tasks in realistic environments. Surprisingly, we find that imitation learning on these exploration trajectories outperforms all other auxiliary losses even despite the exploration trajectories being dissimilar from the downstream tasks. This suggests that imitation of exploration may be "all you need" for building powerful planning representations. Additionally, we find that simple alternatives of popular auxiliary losses can improve their support for downstream planning ability. |
Yuxuan (Effie) Li · Luca Weihs 🔗 |
-
|
Contrastive Abstraction for Reinforcement Learning
(
Poster
)
>
link
Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, contrastive abstraction learning does not require rewards and facilitates efficient reinforcement learning for wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning. |
Vihang Patil · Markus Hofmarcher · Elisabeth Rumetshofer · Sepp Hochreiter 🔗 |
-
|
Work-in-Progress: Using Symbolic Planning with Deep RL to Improve Learning
(
Poster
)
>
link
Deep Reinforcement Learning (DRL) has achieved expressive success across a wide range of domains. However, it is still faced with the sample-inefficiency problem that requires massive training samples to learn the optimal policy. Furthermore, the trained policy is highly dependent on the training environment which limits the generalization. In this paper, we propose the Planning-guided RL (PRL) approach to explore how symbolic planning can help DRL in terms of efficiency and generalization. Our PRL is a two-level structure that incorporates any symbolic planner as the meta-controller to derive the subgoals. The low-level controller learns how to achieve the subgoals. We evaluate PRL on Montezuma's Revenge and results show that PRL outperforms previous hierarchical methods. The evaluation of generalization is a work-in-progress. |
Tianpei Yang · Srijita Das · Christabel Wayllace · Matthew Taylor 🔗 |
-
|
Graph Neural Networks and Graph Kernels For Learning Heuristics: Is there a difference?
(
Poster
)
>
link
Graph neural networks (GNNs) have been used in various works for learningheuristics to guide search for planning. However, they are hindered by theirslow evaluation speed and their limited expressiveness. It is also a known factthat the expressiveness of common GNNs is bounded by the Weisfeiler-Lehman(WL) algorithm for testing graph isomorphism, with which one can generateshallow embeddings for graphs. Thus, one may ask how do GNNs compare againststatistical machine learning models such as linear regression and kernel methodswith WL features of planning problems represented as graphs? Our experimentsshow that simple linear regression is at least as competitive as GNN models forlearning heuristics for planning in the learning track of the recent 2023 InternationalPlanning Competition (IPC). Most notably, our models train in under a minute andachieves performance rivalling models which train for up to 24 hours. We alsodiscuss prevalent issues and open questions in the field of automated learning forplanning which need to be solved in order for the field to progress. |
Dillon Chen · Felipe Trevizan · Sylvie Thiebaux 🔗 |
-
|
Learning How to Create Generalizable Hierarchies for Robot Planning
(
Poster
)
>
link
This paper addresses the problem of inventing and using hierarchical representations for stochastic robot-planning problems. Rather than using hand-coded state or action representations as input, it presents new methods for learning how to create a generalizable high-level action representation for long-horizon, sparse reward robot planning problems in stochastic settings with unknown dynamics. After training, this system yields a robot-class-specific but environment independent planning system that generalizes to different robots, environments, and problem instances. Given new problem instances in unseen stochastic environments, it first creates zero-shot options (without any experience on the new environment) with dense pseudo-rewards and then uses them to solve the input problem in a hierarchical planning and refinement process. Theoretical results identify sufficient conditions for completeness of the presented approach. Extensive empirical analysis shows that even in settings that go beyond these sufficient conditions, this approach convincingly outperforms baselines by $2\times$ in terms of solution time with orders of magnitude improvement in solution quality.
|
Naman Shah · Siddharth Srivastava 🔗 |
-
|
Plansformer: Generating Symbolic Plans using Transformers
(
Poster
)
>
link
Large Language Models (LLMs) have been the subject of active research, significantly advancing the field of Natural Language Processing (NLP). From BERT to BLOOM, LLMs have surpassed state-of-the-art results in various natural language tasks such as question answering, summarization, and text generation. Many ongoing efforts focus on understanding LLMs' capabilities, including their knowledge of the world, syntax, and semantics. However, extending the textual prowess of LLMs to symbolic reasoning has been slow and predominantly focused on tackling problems related to the mathematical field. In this paper, we explore the use of LLMs for automated planning - a branch of AI concerned with the realization of action sequences (plans) to achieve a goal, typically executed by intelligent agents, autonomous robots, and unmanned vehicles. We introduce Plansformer, an LLM fine-tuned on planning problems and capable of generating plans with favorable behavior in terms of correctness and length with reduced knowledge-engineering efforts. We also demonstrate the adaptability of Plansformer in solving different planning domains with varying complexities, owing to the transfer learning abilities of LLMs. For one configuration of Plansformer, we achieve ~97\% valid plans, out of which ~95\% are optimal for Towers of Hanoi - a puzzle-solving domain. |
Vishal Pallagani · Bharath Muppasani · Keerthiram Murugesan · Francesca Rossi · Lior Horesh · Biplav Srivastava · Francesco Fabiano · Andrea Loreggia 🔗 |
-
|
Learning Task Embeddings for Teamwork Adaptation in Multi-Agent Reinforcement Learning
(
Poster
)
>
link
Successful deployment of multi-agent reinforcement learning often requires agents to adapt their behaviour. In this work, we discuss the problem of teamwork adaptation in which a team of agents needs to adapt their policies to solve novel tasks with limited fine-tuning. Motivated by the intuition that agents need to be able to identify and distinguish tasks in order to adapt their behaviour to the current task, we propose to learn multi-agent task embeddings (MATE). These task embeddings are trained using an encoder-decoder architecture optimised for reconstruction of the transition and reward functions which uniquely identify tasks. We show that a team of agents is able to adapt to novel tasks when provided with task embeddings. We propose three MATE training paradigms: independent MATE, centralised MATE, and mixed MATE which vary in the information used for the task encoding. We show that the embeddings learned by MATE identify tasks and provide useful information which agents leverage during adaptation to novel tasks. |
Lukas Schäfer · Filippos Christianos · Amos Storkey · Stefano Albrecht 🔗 |
-
|
Towards More Likely Models for AI Planning
(
Poster
)
>
link
This is the first work to look at the application of large language models (LLMs) for the purpose of model space edits in automated planning tasks.To set the stage for this sangam, we start by enumerating the different flavors ofmodel space problems that have been studied so far in the AI planning literatureand explore the effect of an LLM on those tasks with detailed illustrative examples. We also empirically demonstrate how the performance of an LLMcontrasts with combinatorial search (CS) -- an approach that has been traditionally used to solve model space tasks in planning, both with the LLM in the role of a standalone model space reasoner as well as in the role of a statistical modeling tool in concert with the CS approach as part of a two-stage process.Our experiments show promising results suggesting further forays of LLMs into the exciting world of model space reasoning for planning tasks in the future. |
Turgay Caglar · Sirine Belhaj · Tathagata Chakraborti · Michael Katz · Sarath Sreedharan 🔗 |
-
|
Learning AI-System Capabilities under Stochasticity
(
Poster
)
>
link
Learning interpretable generalizable models of sequential decision-making agents is essential for user-driven assessment as well as for continual agent-design processes in several AI applications. Discovering an agent's broad capabilities in terms of concepts a user understands and summarizing them for a user is a comparatively new solution approach for agent assessment. Prior work on this topic focuses on deterministic settings, or settings where the name of agent's capabilities are already known, or situations where the learning system has access to only passively collected data regarding the agent's behavior. These settings result in a limited scope and/or accuracy of the learned models. This paper presents an approach for discovering a black-box sequential decision making agent's capabilities and interactively learning an interpretable model of the agent in stochastic settings. Our approach uses an initial set of observations to discover the agent's capabilities and a hierarchical querying process to learn a probability distribution of the discovered stochastic capabilities. Our evaluation demonstrates that our method learns lifted SDM models with complex capabilities accurately. |
Pulkit Verma · Rushang Karia · Gaurav Vipat · Anmol Gupta · Siddharth Srivastava 🔗 |
-
|
Contextual Pre-Planning on Reward Machine Abstractions for Enhanced Transfer in Deep Reinforcement Learning
(
Poster
)
>
link
Recent studies show that deep reinforcement learning (DRL) agents tend to overfit to the task on which they were trained and fail to adapt to minor environment changes. To expedite learning when transferring to unseen tasks, we propose a novel approach to representing the current task using reward machines (RM), state machine abstractions that induce subtasks based on the current task’s rewards and dynamics. Our method provides agents with symbolic representations of optimal transitions from their current abstract state and rewards them for achieving these transitions. These representations are shared across tasks, allowing agents to exploit knowledge of previously encountered symbols and transitions, thus enhancing transfer. Our empirical evaluation shows that our representations improve sample efficiency and few-shot transfer in a variety of domains. |
Guy Azran · Mohamad Hosein Danesh · Stefano Albrecht · Sarah Keren 🔗 |
-
|
Exploiting Contextual Structure to Generate Useful Auxiliary Tasks
(
Poster
)
>
link
Reinforcement learning requires interaction with environments, which can be prohibitively expensive, especially in robotics. This constraint necessitates approaches that work with limited environmental interaction by maximizing the reuse of previous experiences. We propose an approach that maximizes experience reuse while learning to solve a given task by generating and simultaneously learning useful auxiliary tasks. To generate these tasks, we construct an abstract temporal logic representation of the given task and leverage large language models to generate context-aware object embeddings that facilitate object replacements. Counterfactual reasoning and off-policy methods allow us to simultaneously learn these auxiliary tasks while solving the given target task. We combine these insights into a novel framework for multitask reinforcement learning and experimentally show that our generated auxiliary tasks share similar underlying exploration requirements as the given task, thereby maximizing the utility of directed exploration. Our approach allows agents to automatically learn additional useful policies without extra environment interaction. |
Benedict Quartey · Ankit Shah · George Konidaris 🔗 |
-
|
Normalization Enhances Generalization in Visual Reinforcement Learning
(
Poster
)
>
link
Recent advances in visual reinforcement learning (RL) have led to impressive success in handling complex tasks. However, these methods have demonstrated limited generalization capability to visual disturbances, which poses a significant challenge for their real-world application and adaptability. Though normalization techniques have demonstrated huge success in supervised and unsupervised learning, their applications in visual RL are still scarce. In this paper, we explore the potential benefits of integrating normalization into visual RL methods with respect to generalization performance. We find that, perhaps surprisingly, incorporating suitable normalization techniques is sufficient to enhance the generalization capabilities, without any additional special design. We utilize the combination of two normalization techniques, CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are conducted on DMControl Generalization Benchmark and CARLA to validate the effectiveness of our method. We show that our method significantly improves generalization capability while only marginally affecting sample efficiency. In particular, when integrated with DrQ-v2, our method enhances the test performance of DrQ-v2 on CARLA across various scenarios, from 14% of the training performance to 97%. |
Lu Li · Jiafei Lyu · Guozheng Ma · Zilin Wang · Zhenjie Yang · Xiu Li · Zhiheng Li 🔗 |
-
|
Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
(
Poster
)
>
link
Exploration in sparse-reward reinforcement learning is difficult due to the need for long, coordinated sequences of actions in order to achieve any reward. Moreover, in continuous action spaces there are an infinite number of possible actions, which only increases the difficulty of exploration. One class of methods designed to address these issues forms temporally extended actions, often called skills, from interaction data collected in the same domain, and optimizes a policy on top of this new action space. Typically such methods require a lengthy pretraining phase, especially in continuous action spaces, in order to form the skills before reinforcement learning can begin. Given prior evidence that the full range of the continuous action space is not required in such tasks, we propose a novel approach to skill-generation with two components. First we discretize the action space through clustering, and second we leverage a tokenization technique borrowed from natural language processing to generate temporally extended actions. Such a method outperforms baselines for skill-generation in several challenging sparse-reward domains, and requires orders-of-magnitude less computation in skill-generation and online rollouts. |
David Yunis · Justin Jung · Falcon Dai · Matthew Walter 🔗 |
-
|
Contrastive Representations Make Planning Easy
(
Poster
)
>
link
Probabilistic inference over time series data is challenging when observations are high-dimensional. In this paper, we show how inference questions relating to prediction and planning can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learning to time series data. Prior work already shows that the representations learned by contrastive learning encode a probability ratio. By first extending this analysis to show that the marginal distribution over representations is Gaussian, we can then prove that conditional distribution of future representations is also Gaussian. Taken together, these results show that a variant of temporal contrastive learning results in representations distributed according to a Gaussian Markov chain, a graphical model where inference (e.g., filtering, smoothing) has closed form solutions. For example, in one special case the problem of trajectory inference simply corresponds to linear interpolation of the initial and final state representations. We provide brief empirical results validating our theory. |
Benjamin Eysenbach · Vivek Myers · Sergey Levine · Russ Salakhutdinov 🔗 |
-
|
Inverse Reinforcement Learning with Multiple Planning Horizons
(
Poster
)
>
link
In this work, we study an inverse reinforcement learning (IRL) problem where the experts are planning under a shared reward function but with different planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, which makes it harder to identify a reward function. To overcome the challenge, we develop an algorithm that in practice, can learn a reward function similar to the true reward function. We give an empirical characterization of the identifiability and generalizability of the feasible solution set of the reward function. |
Jiayu Yao · Finale Doshi-Velez · Barbara Engelhardt 🔗 |
-
|
Stochastic Safe Action Model Learning
(
Poster
)
>
link
Hand-crafting models of interactive domains is challenging, especially when the dynamics of the domain are stochastic. Therefore, it's useful to be able to automatically learn such models instead. In this work, we propose an algorithm to learn stochastic planning models where the distribution over the sets of effects for each action has a small support, but the sets may set values to an arbitrary number of state attributes (a.k.a. fluents). This class captures the benchmark domains used in stochastic planning, in contrast to the prior work that assumed independence of the effects on individual fluents. Our algorithm has polynomial time and sample complexity when the size of the support is bounded by a constant. Importantly, our learning is safe in that we learn offline from example trajectories and we guarantee that actions are only permitted in states where our model of the dynamics is guaranteed to be accurate. Moreover, we guarantee approximate completeness of the model, in the sense that if the examples are achieving goals from some distribution, then with high probability there will exist plans in our learned model that achieve goals from the same distribution. |
Zihao Deng · Brendan Juba 🔗 |
-
|
Learning Discrete Models for Classical Planning Problems
(
Poster
)
>
link
For many sequential decision making domains, planning is often a necessary to solve problems. However, for domains such as those encountered in robotics, the transition function, also known as the model, is often unknown and coding such a model by hand is often impractical. While planning could be done with a model trained from observed transitions, such approaches are limited by errors accumulating when the model is applied across many timesteps as well as the inability to reidentify states. Furthermore, even given an accurate model, domain-independent planning methods may not be able to reliably solve problems while domain-specific information, such as informative heuristics, may not be available. While domain-independent methods exist that can learn domain-specific heuristic functions, such as DeepCubeA, these methods may assume a pre-determined goal. To solve these problems, we introduce DeepCubeAI, a domain-independent algorithm that learns a model that operates in a discrete latent space, learns a heuristic function that generalizes over start and goal states using this learned model, and combines the learned model and learned heuristic function with search to solve problems. Since the latent space is discrete, we can prevent the accumulation of small errors by rounding and we can reidentify states by simply comparing two binary vectors. In our experiments on a pixel representation of the Rubik's cube and Sokoban, we find that DeepCubeAI is able to apply the model for thousands of steps without accumulating any error. Furthermore, DeepCubeAI solves over 99% of test instances in all domains and generalizes across goal states. |
Forest Agostinelli · Misagh Soltani 🔗 |
-
|
Multi-Agent Learning of Efficient Fulfilment and Routing Strategies in E-Commerce
(
Poster
)
>
link
This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-making: (i) selection of a fulfillment node for each order (including the option of deferral to a future time), and then (ii) routing of vehicles (each of which can carry multiple orders originating from the same warehouse). We propose an approach that combines graph neural networks and reinforcement learning to train the node selection and vehicle routing agents. We include real-world constraints such as warehouse inventory capacity, vehicle characteristics such as travel times, service times, carrying capacity, and customer constraints including time windows for delivery. The complexity of this problem arises from the fact that outcomes (rewards) are driven both by the fulfillment node mapping as well as the routing algorithms, and are spatio-temporally distributed. Our experiments show that this algorithmic pipeline outperforms pure heuristic policies. |
Omkar Shelke · Pranavi Pathakota · Anandsingh Chauhan · Hardik Meisheri · Harshad Khadilkar · Balaraman Ravindran 🔗 |
-
|
Integrating Planning and Deep Reinforcement Learning via Automatic Induction of Task Substructures
(
Poster
)
>
link
Despite recent advancements, deep reinforcement learning (DRL) still struggles at learning sparse-reward goal-directed tasks. On the other hand, classical planning excels at addressing hierarchical tasks by employing symbolic knowledge, yet most of the methods rely on assumptions about pre-defined subtasks, making them inapplicable to problems without domain knowledge or models. To bridge the best of both worlds, we propose a framework that integrates DRL with classical planning by automatically inducing task structures and substructures from a few demonstrations. Specifically, symbolic regression is used for substructure induction by adopting genetic programming where the program model reflects prior domain knowledge of effect rules. We compare the proposed framework to state-of-the-art DRL algorithms, imitation learning methods, and an exploration approach in various domains. Experimental results on various tasks show that our proposed framework outperforms all the abovementioned algorithms in terms of sample efficiency and task performance. Moreover, our framework achieves strong generalization performance by effectively inducing new rules and composing task structures. Ablation studies justify the design of our induction module and the proposed genetic programming procedure. |
Jung-Chun Liu · Chi-Hsien Chang · Shao-Hua Sun · Tian-Li Yu 🔗 |
-
|
Learning Generalizable Symbolic Options for Transfer in Reinforcement Learning
(
Poster
)
>
link
This paper presents a new approach for Transfer Reinforcement Learning (RL) for Stochastic Shortest Path (SSP) problems in factored domains with unknown transition functions. We take as input a set of problem instances with sparse reward functions. The presented approach first learns a semantically well-defined state abstraction and then uses this abstraction to invent high-level options, to learn abstract policies for executing them, as well as to create abstract symbolic representations for representing them. Given a new problem instance, our overall approach conducts a novel bi-directional search over the learned option representations while also inventing new options as needed. Our main contributions are approaches for continually learning transferable, generalizable knowledge in the form of symbolically represented options, as well as for integrating search techniques with RL to solve new problems by efficiently composing the learned options. Empirical results show that the resulting approach effectively transfers learned knowledge and achieves superior sample efficiency compared to SOTA methods. |
Rashmeet Kaur Nayyar · Shivanshu Verma · Siddharth Srivastava 🔗 |
-
|
Inductive Generalization in Reinforcement Learning from Specifications
(
Poster
)
>
link
Reinforcement Learning (RL) from logical specifications is a promising approach to learning control policies for complex long-horizon tasks. While these algorithms showcase remarkable scalability and efficiency in learning, a persistent hurdle lies in their limited ability to generalize the policies they generate. In this work, we present an inductive framework to improve policy generalization from logical specifications. We observe that logical specifications can be used to define a class of inductive tasks known as repeated tasks. These are tasks with similar overarching goals but differing inductively in low-level predicates and distributions. Hence, policies for repeated tasks should also be inductive. To this end, we present a compositional approach that learns policies for unseen repeated tasks by training on few repeated tasks only. Our approach is evaluated on challenging control benchmarks with continuous state and action spaces, showing promising results in handling long-horizon tasks with improved generalization. |
Rohit kushwah · Vignesh Subramanian · Suguman Bansal · Subhajit Roy 🔗 |
-
|
MERMAIDE: Learning to Align Learners using Model-Based Meta-Learning
(
Poster
)
>
link
We study how a principal can efficiently and effectively intervene on the rewards of a previously unseen *learning* agent in order to induce desirable outcomes. This is relevant to many real-world settings like auctions or taxation, where the principal may not know the learning behavior nor the rewards of real people. Moreover, the principal should be few-shot adaptable and minimize the number of interventions, because interventions are often costly. We introduce MERMAIDE, a model-based meta-learning framework to train a principal that can quickly adapt to out-of-distribution agents with different learning strategies and reward functions. We validate this approach step-by-step. First, in a Stackelberg setting with a best-response agent, we show that meta-learning enables quick convergence to the theoretically known Stackelberg equilibrium at test time, although noisy observations severely increase the sample complexity. We then show that our model-based meta-learning approach is cost-effective in intervening on bandit agents with unseen explore-exploit strategies. Finally, we outperform baselines that use either meta-learning or agent behavior modeling, in both $0$-shot and $1$-shot settings with partial agent information.
|
Arundhati Banerjee · Soham Phade · Stefano Ermon · Stephan Zheng 🔗 |
-
|
Modeling Boundedly Rational Agents with Latent Inference Budgets
(
Poster
)
>
link
We study the problem of modeling a population of agents pursuing unknown goals subject to unknown computational constraints. In standard models of bounded rationality, sub-optimal decision-making is simulated by adding homoscedastic noise to optimal decisions rather than actually simulating constrained inference. In this work, we introduce a latent inference budget model (L-IBM) that models these constraints explicitly, via a latent variable (inferred jointly with a model of agents’ goals) that controls the runtime of an iterative inference algorithm. L-IBMs make it possible to learn agent models using data from diverse populations of suboptimal actors. In three modeling tasks—inferring navigation goals from routes, inferring communicative intents from human utterances, and predicting next moves in human chess games—we show that L-IBMs match or outperforms Boltzmann models of decision-making under uncertainty. Moreover, the inferred inference budgets are themselves meaningful, efficient to compute, and correlated with measures of player skill, partner skill and task difficulty. |
Athul Jacob · Abhishek Gupta · Jacob Andreas 🔗 |
-
|
Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI
(
Poster
)
>
link
We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to plan and solve complex activities resembling everyday human household tasks. The Mini-BEHAVIOR environment extends the widely used MiniGrid grid world with new modes of actuation, combining navigation and manipulation actions, multiple objects, states, scenes, and activities defined in first-order logic. Mini-BEHAVIOR implements various household tasks from the original BEHAVIOR benchmark, along with starter code for data collection and reinforcement learning agent training. Together with Mini-BEHAVIOR, we also include a procedural generation mechanism to create countless variations of each task and support the study of plan generalization and open-ended learning. Mini-BEHAVIOR is fast and easy to use and extend, providing the benefits of rapid prototyping while striking a good balance between symbolic-level decision-making and physical realism, complexity, and embodiment constraints found in complex embodied AI benchmarks. Our goal with Mini-BEHAVIOR is to provide the community with a fast, easy-to-use and modify, open-ended benchmark for developing and evaluating decision-making and generalizing planning solutions for embodied AI. Code is available at https://github.com/StanfordVL/mini_behavior. |
Emily Jin · Jiaheng Hu · Zhuoyi Huang · Ruohan Zhang · Jiajun Wu · Fei-Fei Li · Roberto Martín-Martín 🔗 |
-
|
Learning Safe Action Models with Partial Observability
(
Poster
)
>
link
A common approach for solving planning problems is to model them in a formal language such as the Planning Domain Definition Language (PDDL), and then use an appropriate PDDL planner. Several algorithms for learning PDDL models from observations have been proposed but plans created with these learned models may not be sound. We propose two algorithms for learning PDDL models that are guaranteed to be safe to use even when given observations that include partially observable states. We analyze these algorithms theoretically, characterizing the sample complexity each algorithm requires to guarantee probabilistic completeness. We also show experimentally that our algorithms are often better than FAMA, a state-of-the-art PDDL learning algorithm. |
Brendan Juba · Hai Le · Ron T Stern 🔗 |
-
|
Value Iteration with Value of Information Networks
(
Poster
)
>
link
Despite great success in recent years, deep reinforcement learning architectures still face a tremendous challenge in dealing with uncertainty and perceptual ambiguity. Similarly, networks that learn to build the world model from the input and perform model-based decision making in novel environments (e.g., value iteration networks) are mostly limited to fully observable tasks. In this paper, we propose a new planning module architecture, the VI$^2$N (Value Iteration with Value of Information Network), that learns to act in novel environments with a high amount of perceptual ambiguity. This architecture over-emphasizes reducing the uncertainty before exploiting the reward. Our network outperforms other deep architecture in challenging partially observable environments. Moreover, it generates interpretable cognitive maps highlighting both rewarding and informative locations. The similarity of principles and computations of our network with observed cognitive processes and neural activity in the Hippocampus draw a strong connection between VI$^2$N and principles of computations in the biological networks.
|
Samantha Johnson · Michael Buice · Koosha Khalvati 🔗 |
-
|
Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
(
Poster
)
>
link
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. The project website can be found at https://subgoal-image-editing.github.io |
Kevin Black · Mitsuhiko Nakamoto · Pranav Atreya · Homer Walke · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL
(
Poster
)
>
link
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.
|
Xiyao Wang · Ruijie Zheng · Yanchao Sun · ruonan jia · Wichayaporn Wongkamjan · Huazhe Xu · Furong Huang 🔗 |
-
|
General and Reusable Indexical Policies and Sketches
(
Poster
)
>
link
Recently, a simple but powerful language for expressing general policies and problem decompositions (sketches) have been introduced that is based on collections of rules defined on a set of Boolean and numerical features. In this work, we consider extensions of this basic language aimed at making policies and sketches more flexible and reusable. For this, three basic extensions are considered: 1) internal memory states, as in finite state controllers, 2) indexical features, whose values are a function of the state and a number of internal registers that can be loaded with objects, and 3) modules that wrap up policies and sketches and allow them to call each other by passing parameters. The expressive power of the resulting language for policies and sketches that are general and reusable is illustrated through examples. |
Blai Bonet · Dominik Drexler · Hector Geffner 🔗 |
-
|
Improving Generalization in Reinforcement Learning Training Regimes for Social Robot Navigation
(
Poster
)
>
link
In order for autonomous mobile robots to navigate in human spaces, they must abide by our social norms. Reinforcement learning (RL) has emerged as an effective method to train robot sequential decision-making policies that are able to respect these norms. However, a large portion of existing work in the field conducts both RL training and testing in simplistic environments. This limits the generalization potential of these models to unseen environments, and undermines the meaningfulness of their reported results. We propose a method to improve the generalization performance of RL social navigation methods using curriculum learning. By employing multiple environment types and by modeling pedestrians using multiple dynamics models, we are able to progressively diversify and escalate difficulty in training. Our results show that the use of curriculum learning in training can be used to achieve better generalization performance than previous training methods. We also show that results presented in many existing state-of-the art RL social navigation works do not evaluate their methods outside of their training environments, and thus do not reflect their policies' failure to adequately generalize to out-of-distribution scenarios. In response, we validate our training approach on larger and more crowded testing environments than those used in training, allowing for more meaningful measurements of model performance. |
Adam Sigal · Hsiu-Chin Lin · AJung Moon 🔗 |
-
|
Conservative World Models
(
Poster
)
>
link
Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline pre-training phase. Forward-backward (FB) representations represent remarkable progress towards this ideal, achieving 85% of the performance of task-specific agents in this setting. However, such performance is contingent on access to large and diverse datasets for pre-training, which cannot be expected for most real problems. Here, we explore how FB performance degrades when trained on small datasets that lack diversity, and mitigate it with conservatism, a well-established feature of performant offline RL algorithms. We evaluate our family of methods across various datasets, domains and tasks, reaching 150% of vanilla FB performance in aggregate. Somewhat surprisingly, conservative FB algorithms also outperform the task-specific baseline, despite lacking access to reward labels and being required to maintain policies for all tasks. Conservative FB algorithms perform no worse than FB on full datasets, and so present little downside over their predecessor. Our code is available open-source via: https://enjeeneer.io/projects/conservative-world-models/. |
Scott Jeen · Tom Bewley · Jonathan Cullen 🔗 |
-
|
Targeted Uncertainty Reduction in Robust MDPs
(
Poster
)
>
link
Robust Markov decision processes (MDPs) provide a practical framework for generalizing trained agents to new environments. There, the objective is to maximize performance under the worst model of a given uncertainty set. By construction, this raises a performance-robustness dilemma: accounting for too large uncertainty yields guarantees against larger disturbances, whilst too small uncertainty may result in over-sensitivity to model misspecification. In this work, we introduce an online method that addresses the conservativeness of robust MDPs by strategically contracting the uncertainty set. First, we explicitly formulate the gradient of the robust return with respect to the uncertainty radius. This gradient derivation enables us to prioritize efforts in reducing uncertainty and leads us to interesting findings on the relation between the robust return and the uncertainty set. Second, we present a sampling-based algorithm aimed at enhancing our uncertainty estimation with respect to the robust return. Third, we illustrate the effectiveness of our algorithm within a tabular environment. |
Uri Gadot · Kaixin Wang · Esther Derman · Navdeep Kumar · Kfir Y. Levy · Shie Mannor 🔗 |
-
|
Quantized Local Independence Discovery for Fine-Grained Causal Dynamics Learning in Reinforcement Learning
(
Poster
)
>
link
Incorporating causal relationships between the variables into dynamics learning has emerged as a promising approach to enhance robustness and generalization in reinforcement learning (RL). Recent studies have focused on examining conditional independences and leveraging only relevant state and action variables for prediction. However, such approaches tend to overlook local independence relationships that hold under certain circumstances referred as event. In this work, we present a theoretically-grounded and practical approach to dynamics learning which discovers such meaningful events and infers fine-grained causal relationships. The key idea is to learn a discrete latent variable that represents the pair of event and causal relationships specific to the event via vector quantization. As a result, our method provides a fine-grained understanding of the dynamics by capturing event-specific causal relationships, leading to improved robustness and generalization in RL. Experimental results demonstrate that our method is more robust to unseen states and generalizes well to downstream tasks compared to prior approaches. In addition, we find that our method successfully identifies meaningful events and recovers event-specific causal relationships. |
Inwoo Hwang · Yun-hyeok Kwak · Suhyung Choi · Byoung-Tak Zhang · Sanghack Lee 🔗 |
-
|
Relating Goal and Environmental Complexity for Improved Task Transfer: Initial Results
(
Poster
)
>
link
The complexity of an environment and the difficulty of an actor's goals both impact transfer learning in Reinforcement Learning (RL). Yet, few works have examined using the environment and goals in tandem to generate a learning curriculum that improves transfer. To explore this relationship, we introduce a task graph that quantifies the environment complexity using environment descriptors and the goal difficulty using goal descriptors; edges in the task graph indicate a change in the environment or the goal. We use the task graph in two sets of studies. First, we evaluate the task graph in two synthetic environments where we control environment and goal complexity. Second, we introduce an algorithm that generates a Task-Graph Curriculum to train policies using the task graph. In a delivery environment with up to ten skills, we demonstrate that a planner can execute these trained policies to achieve long-horizon goals in increasingly complex environments. Our results demonstrate that (1) the task graph promotes skill transfer in the synthetic environments and (2) the Task-Graph Curriculum trains nearly perfect policies and does so significantly faster than learning a policy from scratch. |
Sunandita Patra · Paul Rademacher · Kristen Jacobson · Kyle Hassold · Onur Kulaksizoglu · Laura Hiatt · Mark Roberts · Dana Nau 🔗 |
-
|
Uncertainty-Aware Action Repeating Options
(
Poster
)
>
link
In reinforcement learning, employing temporal abstraction within the action space is a prevalent strategy for simplifying policy learning through temporally-extended actions. Recently, algorithms that repeat a primitive action for a certain number of steps, a simple method to implement temporal abstraction in practice, have demonstrated better performance than traditional algorithms.However, a significant drawback of earlier studies on action repetition is the potential for repeated sub-optimal actions to considerably degrade performance.To tackle this problem, we introduce a new algorithm that employs ensemble methods to estimate uncertainty when extending an action. Our framework offers flexibility, allowing policies to either prioritize exploration or adopt an uncertainty-averse stance based on their specific needs.We provide empirical results on various environments, highlighting the superior performance of our proposed method compared to other action-repeating algorithms. These results indicate that our uncertainty-aware strategy effectively counters the downsides of action repetition, enhancing policy learning efficiency. |
Joongkyu Lee · Seung Joon Park · Yunhao Tang · Min-hwan Oh 🔗 |
-
|
Robust Driving Across Scenarios via Multi-residual Task Learning
(
Poster
)
>
link
Conventional control, such as model-based control, is commonly utilized in autonomous driving due to its efficiency and reliability. However, real-world autonomous driving contends with a multitude of diverse traffic scenarios that are challenging for these planning algorithms. Model-free Deep Reinforcement Learning (DRL) presents a promising avenue in this direction, but learning DRL control policies that generalize to multiple traffic scenarios is still a challenge. To address this, we introduce Multi-residual Task Learning (MRTL), a generic learning framework based on multi-task learning that, for a set of task scenarios, decomposes the control into nominal components that are effectively solved by conventional control methods and residual terms which are solved using learning. We employ MRTL for fleet-level emission reduction in mixed traffic using autonomous vehicles as a means of system control. By analyzing the performance of MRTL across nearly 600 signalized intersections and 1200 traffic scenarios, we demonstrate that it emerges as a promising approach to synergize the strengths of DRL and conventional methods in generalizable control. |
Vindula Jayawardana · Sirui Li · Cathy Wu · Yashar Farid · Kentaro Oguchi 🔗 |
-
|
Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels
(
Poster
)
>
link
The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and thus enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain. |
Thomas Jiralerspong · Flemming Kondrup · Doina Precup · Khimya Khetarpal 🔗 |
-
|
A Study of Generalization in Offline Reinforcement Learning
(
Poster
)
>
link
Despite the recent progress in offline reinforcement learning (RL) algorithms, agents are usually trained and tested on the same environment. In this paper, we perform an in-depth study of the generalization abilities of offline RL algorithms, showing that they struggle to generalize to new environments. We also introduce the first benchmark for evaluating generalization in offline learning, collecting datasets with varying sizes and skill-levels from Procgen (2D video games) and WebShop (e-commerce websites). The datasets contain trajectories for a limited number of game levels or natural language instructions and at test time, the agent has to generalize to new levels or instructions. Our experiments reveal that existing offline learning algorithms perform significantly worse than online RL on both train and test environments. Behavioral cloning is a strong baseline, typically outperforming offline RL and sequence modeling approaches when trained on data from multiple environments and tested on new ones. Finally, we find that increasing the diversity of the data, rather than its size, improves generalization for all algorithms. Our study demonstrates the limited generalization of current offline learning algorithms highlighting the need for more research in this area. |
Ishita Mediratta · Qingfei You · Minqi Jiang · Roberta Raileanu 🔗 |