Workshop
Agent Learning in Open-Endedness Workshop
Minqi Jiang · Mikayel Samvelyan · Jack Parker-Holder · Mayalen Etcheverry · Yingchen Xu · Michael Dennis · Roberta Raileanu
Room 211 - 213
Open-ended learning (OEL) is receiving rapidly growing attention in recent years, as deep learning models become ever more adept at learning meaningful and useful behaviors from web-scale data. Improving the performance and generality of such models depends greatly on our ability to continue to collect new and useful training data. OEL systems co-evolve the learning agent (e.g. the model) with its environment or other sources of training data, resulting in the continued, active generation of new training data specifically useful for the current agent or model. Conceivably such OEL processes, if designed appropriately, can lead to models exhibiting increasingly general capabilities. However, it remains an open problem to produce a truly open-ended system in practice, one that endlessly generates meaningfully novel data. We hope our workshop provides a forum both for bridging knowledge across a diverse set of relevant fields as well as sparking new insights that can enable truly open-ended learning systems.
Schedule
Fri 7:00 a.m. - 7:00 a.m.
|
Introductory remarks
(
Introductory remarks
)
>
SlidesLive Video |
Minqi Jiang · Mikayel Samvelyan 🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Is Simulation Dead?
(
Invited talk
)
>
SlidesLive Video |
Tim Rocktäschel 🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Lisa Soros
(
Invited talk
)
>
SlidesLive Video |
Lisa Soros 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Adaptive Machines: Unleashing the Power of Evolutionary Reinforcement Learning for Versatile and Resilient Robotics
(
Invited talk
)
>
SlidesLive Video |
Antoine Cully 🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Amorphous Fortress: Exploring Emergent Behavior in Open-Ended Simulations
(
Invited talk
)
>
SlidesLive Video |
M Charity 🔗 |
Fri 9:00 a.m. - 9:15 a.m.
|
WebArena: A Realistic Web Environment for Building Autonomous Agents
(
Spotlight talk
)
>
SlidesLive Video |
Shuyan Zhou 🔗 |
Fri 9:15 a.m. - 9:30 a.m.
|
OMNI: Open-endedness via Models of human Notions of Interestingness
(
Spotlight talk
)
>
SlidesLive Video |
Jenny Zhang 🔗 |
Fri 9:30 a.m. - 9:45 a.m.
|
Voyager: An Open-Ended Embodied Agent with Large Language Models
(
Spotlight talk
)
>
SlidesLive Video |
Guanzhi Wang 🔗 |
Fri 10:45 a.m. - 11:45 a.m.
|
Poster session
(
Poster session
)
>
|
🔗 |
Fri 11:45 a.m. - 12:15 p.m.
|
Abstraction and Analogy are the Keys to Robust, Open-Ended AI
(
Invited talk
)
>
SlidesLive Video |
Melanie Mitchell 🔗 |
Fri 12:15 p.m. - 12:45 p.m.
|
Open-Ended and AI-Generating Algorithms in the Era of Foundation
(
Invited talk
)
>
SlidesLive Video |
Jeff Clune 🔗 |
Fri 12:45 p.m. - 1:15 p.m.
|
Algorithmic Scenario Generation as Quality Diversity Optimization
(
Invited talk
)
>
SlidesLive Video |
Stefanos Nikolaidis 🔗 |
Fri 1:15 p.m. - 1:45 p.m.
|
Feryal Behbahani
(
Invited talk
)
>
SlidesLive Video |
Feryal Behbahani 🔗 |
Fri 1:45 p.m. - 2:00 p.m.
|
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
(
Spotlight talk
)
>
SlidesLive Video |
Martin Klissarov · Pierluca D'Oro 🔗 |
Fri 2:00 p.m. - 2:15 p.m.
|
Eureka: Human-Level Reward Design via Coding Large Language Models
(
Spotlight talk
)
>
SlidesLive Video |
Jason Ma 🔗 |
Fri 2:15 p.m. - 2:30 p.m.
|
Quality Diversity through Human Feedback
(
Spotlight talk
)
>
SlidesLive Video |
Li Ding 🔗 |
Fri 2:30 p.m. -
|
Discussion Panel
(
Panel
)
>
SlidesLive Video |
Jeff Clune · Linxi Fan · M Charity · Antoine Cully · Stefanos Nikolaidis · Roberta Raileanu · Tim Rocktäschel 🔗 |
-
|
Noisy ZSC: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games
(
Poster
)
>
link
Zero-shot coordination (ZSC) is a popular setting for studying the ability of AI agents to coordinate with novel partners. Prior formulations of ZSC make the assumption that the problem setting is common knowledge i.e. each agent has the knowledge of the underlying Dec-POMDP, every agent knows the others have this knowledge, and so on ad infinitum. However, in most real-world situations, different agents are likely to have different models of the (real world) environment, thus breaking this assumption. To address this limitation, we formulate the noisy zero-shot coordination (NZSC) problem, where agents observe different noisy versions of the ground truth Dec-POMDP generated by passing the true Dec-POMDP through a noise model. Only the distribution of the ground truth Dec-POMDPs and the noise model are common knowledge. We show that any noisy ZSC problem can be reformulated as a ZSC problem by designing a meta-Dec-POMDP with an augmented state space consisting of both the ground truth Dec-POMDP and its corresponding state. In our experiments, we analyze various aspects of NZSC and show that achieving good performance in NZSC requires agents to make use of both the noisy observations of ground truth Dec-POMDP, knowledge of each other's noise models and their interactions with the ground truth Dec-POMDP. Through experimental results, we further establish that ignoring the noise in problem specification can result in sub-par ZSC coordination performance, especially in iterated scenarios. On the whole, our work highlights that NZSC adds an orthogonal challenge to traditional ZSC in tackling the uncertainty about the true problem. |
Usman Anwar · Jia Wan · David Krueger · Jakob Foerster 🔗 |
-
|
Stackelberg Driver Model for Continual Policy Improvement in Scenario-Based Closed-Loop Autonomous Driving
(
Poster
)
>
link
The deployment of autonomous vehicles (AVs) has faced hurdles due to the dominance of rare but critical corner cases within the long-tail distribution of driving scenarios, which negatively affects their overall performance. To address this challenge, adversarial generation methods have emerged as a class of efficient approaches to synthesize safety-critical scenarios for AV testing. However, these generated scenarios are often underutilized for AV training, resulting in the potential for continual AV policy improvement remaining untapped, along with a deficiency in the closed-loop design needed to achieve it. Therefore, we tailor the Stackelberg Driver Model (SDM) to accurately characterize the hierarchical nature of vehicle interaction dynamics, facilitating iterative improvement by engaging background vehicles (BVs) and AV in a sequential game-like interaction paradigm. With AV acting as the leader and BVs as followers, this leader-follower modeling ensures that AV would consistently refine its policy, always taking into account the additional information that BVs play the best response to challenge AV. Extensive experiments have shown that our algorithm exhibits superior performance compared to several baselines especially in higher dimensional scenarios, leading to substantial advancements in AV capabilities while continually generating progressively challenging scenarios. |
Haoyi Niu · Qimao Chen · Yingyue Li · Jianming HU 🔗 |
-
|
Syllabus: Curriculum Learning Made Easy
(
Poster
)
>
link
Curriculum learning has been a quiet yet crucial component of many of the high-profile successes of reinforcement learning. Despite this, none of the major reinforcement learning libraries support curriculum learning or include curriculum learning algorithms. Curriculum learning methods can provide general and complementary improvements to RL algorithms, but they often require significant, complex changes to agent training code. We introduce Syllabus, a library for training RL agents with curriculum learning, as a solution to this problem. Syllabus provides a universal API for implementing curriculum learning algorithms, a collection of implementations of popular curriculum learning methods, and infrastructure for easily integrating them into existing distributed RL code. Syllabus provides a clean API for each of the complex components of these methods, dramatically simplifying the process for designing new algorithms or applying existing algorithms to new environments. Syllabus also manages the multiprocessing communication required for curriculum learning, alleviating one of the key practical challenges of using these algorithms. We hope Syllabus will improve the process of developing and applying curriculum learning algorithms, and encourage widespread adaptation of curriculum learning. |
Ryan Sullivan 🔗 |
-
|
Rethinking Teacher-Student Curriculum Learning under the Cooperative Mechanics of Experience
(
Poster
)
>
link
Teacher-Student Curriculum Learning (TSCL) is a curriculum learning framework that draws inspiration from human cultural transmission and learning. It involves a teacher algorithm shaping the learning process of a learner algorithm by exposing it to controlled experiences. Despite its success, understanding the conditions under which TSCL is effective remains challenging. In this paper, we propose a data-centric perspective to analyze the underlying mechanics of the teacher-student interactions in TSCL. We leverage cooperative game theory to describe how the composition of the set of experiences presented by the teacher to the learner, as well as their order, influences the performance of the curriculum that are found by TSCL approaches. To do so, we demonstrate that for every TSCL problem, there exists an equivalent cooperative game, and several key components of the TSCL framework can be reinterpreted using game-theoretic principles. Through experiments covering supervised learning, reinforcement learning, and classical games, we estimate the cooperative values of experiences and use value-proportional curriculum mechanisms to construct curricula, even in cases where TSCL struggles. The framework and experimental setup we present in this work represent a foundation that can be used for a deeper exploration of TSCL, shedding light on its underlying mechanisms and providing insights into its broader applicability in machine learning. |
Manfred Diaz · Liam Paull · Andrea Tacchetti 🔗 |
-
|
Multi-Agent Diagnostics for Robustness via Illuminated Diversity
(
Poster
)
>
link
In the rapidly advancing field of multi-agent systems, ensuring robustness in unfamiliar and adversarial settings is crucial, particularly for those systems deployed in real-world scenarios. Notwithstanding their outstanding performance in familiar environments, these systems often falter in new situations due to overfitting during the training phase. This is especially pronounced in settings where both cooperative and competitive behaviours are present, encapsulating a dual nature of overfitting and generalisation challenges. To address this issue, we present Multi-Agent Diagnostics for Robustness via Illuminated Diversity (MADRID), a novel approach for systematically generating diverse adversarial scenarios that expose strategic vulnerabilities in pre-trained multi-agent policies. Leveraging the concepts from open-ended learning, MADRID navigates the vast space of adversarial settings, employing a target policy's regret to gauge the vulnerabilities of these settings. We evaluate the effectiveness of MADRID on the 11 vs 11 version of Google Research Football, one of the most complex environments for multi-agent reinforcement learning. Specifically, we employ MADRID for generating a diverse array of adversarial settings for TiZero, the state-of-the-art approach which "masters" the game through 45 days of training on a large-scale distributed infrastructure. Using MADRID, we expose key shortcomings in TiZero's tactical decision-making, underlining the crucial importance of rigorous evaluation in multi-agent systems. |
Mikayel Samvelyan · Davide Paglieri · Minqi Jiang · Jack Parker-Holder · Tim Rocktäschel 🔗 |
-
|
JARVIS-1: Open-Ended Multi-task Agents with Memory-Augmented Multimodal Language Models
(
Poster
)
>
link
We propose a multi-task agent JARVIS-1 designed for the complex environment of Minecraft, marks a significant advancement in achieving human-like planning within an open-world setting. By leveraging pre-trained Vision-Language Models, JARVIS-1 not only effectively interprets multimodal inputs but also adeptly translates them into actions. Its integration of a multimodal memory, which draws from both ingrained knowledge and real-time game experiences, enhances its decision-making capabilities. The empirical evidence of its prowess is evident in its impressive performance across a wide array of tasks in Minecraft. Notably, its achievement in the long-horizon diamond pickaxe task, where it achieved a completion rate that surpasses VPT by up to 5 times, underscores its potential and the strides made in this domain. This breakthrough sets the stage for the future of more versatile and adaptable agents in complex virtual environments. |
Zihao Wang · Shaofei Cai · Anji Liu · Xiaojian (Shawn) Ma · Yitao Liang 🔗 |
-
|
minimax: Efficient Baselines for Autocurricula in JAX
(
Poster
)
>
link
Unsupervised environment design (UED) is a form of automatic curriculum learning for training robust decision-making agents in zero-shot transfer to unseen environments. Such autocurricula methods have gathered much interest from the RL community. However, UED experiments often require up to several weeks of training on standard RL architectures based on CPU rollouts and GPU model updates. This compute requirement is a major obstacle that prevents rapid innovation of UED methods. This work introduces minimax, a library that conducts UED training completely on accelerated hardware. To achieve this feat, minimax takes advantage of the JAX library and ports previous UED environments into fully-tensorized implementations, allowing the entire training loop to be compiled for hardware acceleration. As a petri dish for rapid experimentation, minimax includes a vectorized version of a grid-world based on MiniGrid, in addition to many reusable abstractions for conducting autocurricula in underspecified, procedurally generated environments. With these components, minimax provides strong UED baselines, including new parallelized variants, that achieve over 50$\times$ speed ups in wall time compared to previous implementations.
|
Minqi Jiang · Michael Dennis · Edward Grefenstette · Tim Rocktäschel 🔗 |
-
|
ACES: generating diverse programming puzzles with autotelic language models and semantic descriptors
(
Poster
)
>
link
Finding and selecting new and interesting problems to solve is at the heart of curiosity, science and innovation. We here study automated problem generation in the context of the open-ended space of python programming puzzles. Existing generative models often aim at modeling a reference distribution without any explicit diversity optimization. Other methods explicitly optimizing for diversity do so either in limited hand-coded representation spaces or in uninterpretable learned embedding spaces that may not align with human perceptions of interesting variations. With ACES (Autotelic Code Exploration via Semantic descriptors), we introduce a new autotelic generation method that leverages semantic descriptors produced by a large language model (LLM) to directly optimize for interesting diversity, as well as few-shot-based generation. Each puzzle is labeled along 10 dimensions, each capturing a programming skill required to solve it. ACES generates and pursues novel and feasible goals to explore that abstract semantic space, slowly discovering a diversity of solvable programming puzzles in any given run. Across a set of experiments, we show that ACES discovers a richer diversity of puzzles than existing diversity-maximizing algorithms as measured across a range of diversity metrics. We further study whether and in which conditions this diversity can translate into the successful training of puzzle solving models. |
Julien Pourcel · Cédric Colas · Pierre-Yves Oudeyer · Laetitia Teodorescu 🔗 |
-
|
On the importance of data collection for training general goal-reaching policies.
(
Poster
)
>
link
Recent advances in ML suggest that the quantity of data available to a model is one of the primary bottlenecks to high performance. Although for language-based tasks there exist almost unlimited amounts of reasonably coherent data to train from, this is generally not the case for Reinforcement Learning, especially when dealing with a novel environment. In effect, even a relatively trivial continuous environment has an almost limitless number of states, but simply sampling random states and actions will likely not provide transitions that are interesting or useful for any potential downstream task. \textit{How should one generate massive amounts of useful data given only an MDP with no indication of downstream tasks? Are the quantity and quality of data truly transformative to the performance of a general controller?} We propose to answer both of these questions. First, we introduce a principled unsupervised exploration method, ChronoGEM, which aims to achieve uniform coverage over the manifold of achievable states, which we believe is the most reasonable goal given no prior task information. Secondly, we investigate the effects of both data quantity and data quality on the training of a downstream goal-achievement policy, and show that both large quantities and high-quality of data are essential to train a general controller: a high-precision pose-achievement policy capable of attaining a large number of poses over numerous continuous control embodiments including humanoid. |
Alexis Jacq · Manu Orsini · Gabriel Dulac-Arnold · Olivier Pietquin · Matthieu Geist · Olivier Bachem 🔗 |
-
|
LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers
(
Poster
)
>
link
We propose a framework that leverages foundation models as teachers, guiding a reinforcement learning agent to acquire semantically meaningful behavior without human intervention.In our framework, the agent receives task instructions grounded in a training environment from large language models.Then, a vision-language model guides the agent in learning the tasks by providing reward feedback.We demonstrate that our method can learn semantically meaningful skills in a challenging open-ended MineDojo environment, while prior works on unsupervised skill discovery methods struggle.Additionally, we discuss the observed challenges of using off-the-shelf foundation models as teachers and our efforts to address them. |
Taewook Nam · Juyong Lee · Jesse Zhang · Sung Ju Hwang · Joseph Lim · Karl Pertsch 🔗 |
-
|
WebArena: A Realistic Web Environment for Building Autonomous Agents
(
Spotlight
)
>
link
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that \ours can be used to measure such progress. |
Shuyan Zhou · Frank F. Xu · Hao Zhu · Xuhui Zhou · Robert Lo · Abishek Sridhar · Xianyi Cheng · Tianyue Ou · Yonatan Bisk · Daniel Fried · Uri Alon · Graham Neubig
|
-
|
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
(
Spotlight
)
>
link
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt. |
Martin Klissarov · Pierluca D'Oro · Shagun Sodhani · Roberta Raileanu · Pierre-Luc Bacon · Pascal Vincent · Amy Zhang · Mikael Henaff 🔗 |
-
|
DOGE: Domain Reweighting with Generalization Estimation
(
Poster
)
>
link
The coverage and composition of the pretraining data corpus significantly impacts the generalization ability of large language models. Conventionally, the pretraining corpus is composed of various source domains (e.g. CommonCrawl, Wikipedia, Github etc.) according to certain sampling probabilities (domain weights). However, current methods lack a principled way to optimize domain weights for ultimate goal for generalization. We propose \textsc{DO}main reweighting with \textsc{G}eneralization \textsc{E}stimation (DoGE), where we reweigh the sampling probability from each domain based on its contribution to the final generalization objective assessed by a gradient-based generalization estimation function. First, we train a small-scale proxy model with a min-max optimization to obtain the reweighted domain weights. At each step, the domain weights are updated to maximize the overall generalization gain by mirror descent. Finally we use the obtained domain weights to train a larger scale full-size language model. On SlimPajama-6B dataset, with universal generalization objective, DoGE achieves better average perplexity and zero-shot reasoning accuracy. On out-of-domain generalization tasks, DoGE reduces perplexity on the target domain by a large margin. We further apply a parameter-selection scheme which improves the efficiency of generalization estimation. |
Simin Fan · Matteo Pagliardini · Martin Jaggi 🔗 |
-
|
Voyager: An Open-Ended Embodied Agent with Large Language Models
(
Spotlight
)
>
link
SlidesLive Video We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. |
Guanzhi Wang · Yuqi Xie · Yunfan Jiang · Ajay Mandlekar · Chaowei Xiao · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Curriculum Learning for Cooperation in Multi-Agent Reinforcement Learning
(
Poster
)
>
link
While there has been significant progress in curriculum learning and continuous learning for training agents to generalize across a wide variety of environments in the context of single-agent reinforcement learning, it is unclear if these algorithms would still be valid in a multi-agent setting. In a competitive setting, a learning agent can be trained by making it compete with a curriculum of increasingly skilled opponents. However, a general intelligent agent should also be able to learn to act around other agents and cooperate with them to achieve common goals. When cooperating with other agents, the learning agent must (a) learn how to perform the task (or subtask), and (b) increase the overall team reward. In this paper, we aim to answer the question of what kind of cooperative teammate, and a curriculum of teammates should a learning agent be trained with to achieve these two objectives. Our results on the game Overcooked show that a pre-trained teammate who is less skilled is the best teammate for overall team reward but the worst for the learning of the agent. Moreover, somewhat surprisingly, a curriculum of teammates with decreasing skill levels performs better than other types of curricula. |
Rupali Bhati · Vijaya Sai Krishna Gottipati · Clodéric Mars · Matthew Taylor 🔗 |
-
|
Continual Driving Policy Optimization with Closed-Loop Individualized Curricula
(
Poster
)
>
link
The safety of autonomous vehicles (AV) has been a long-standing top concern, stemming from the absence of rare and safety-critical scenarios in the long-tail naturalistic driving distribution. To tackle this challenge, a surge of research in scenario-based autonomous driving has emerged, with a focus on generating high-risk driving scenarios and applying them to conduct safety-critical testing of AV models. However, limited work has been explored on the reuse of these extensive scenarios to iteratively improve AV models. Moreover, it remains intractable and challenging to filter through gigantic scenario libraries collected from other AV models with distinct behaviors, attempting to extract transferable information for current AV improvement. Therefore, we develop a continual driving policy optimization framework featuring Closed-Loop Individualized Curricula (CLIC), which we factorize into a set of standardized sub-modules for flexible implementation choices: AV Evaluation, Scenario Selection, and AV Training. CLIC frames AV Evaluation as a collision prediction task, where it estimates the chance of AV failures in these scenarios at each iteration. Subsequently, by re-sampling from historical scenarios based on these failure probabilities, CLIC tailors individualized curricula for downstream training, aligning them with the evaluated capability of AV. Accordingly, CLIC not only maximizes the utilization of the vast pre-collected scenario library for closed-loop driving policy optimization but also facilitates AV improvement by individualizing its training with more challenging cases out of those poorly organized scenarios. Experimental results clearly indicate that CLIC surpasses other curriculum-based training strategies, showing substantial improvement in managing risky scenarios, while still maintaining proficiency in handling simpler cases. |
Haoyi Niu · Yizhou Xu · Xingjian Jiang · Jianming HU 🔗 |
-
|
Emergence of collective open-ended exploration from Decentralized Meta-Reinforcement learning
(
Poster
)
>
link
Recent works have proven that intricate cooperative behaviors can emerge in agents trained using meta reinforcement learning on open ended task distributions using self-play. While the results are impressive, we argue that self-play and other centralized training techniques do not accurately reflect how general collective exploration strategies emerge in the natural world: through decentralized training and over an open-ended distribution of tasks. In this work we therefore investigate the emergence of collective exploration strategies, where several agents meta-learn independent recurrent policies on an open ended distribution of tasks. To this end we introduce a novel environment with an open ended procedurally generated task space which dynamically combines multiple subtasks sampled from five diverse task types to form a vast distribution of task trees. We show that decentralized agents trained in our environment exhibit strong generalization abilities when confronted with novel objects at test time. Additionally, despite never being forced to cooperate during training the agents learn collective exploration strategies which allow them to solve novel tasks never encountered during training. We further find that the agents learned collective exploration strategies extend to an open ended task setting, allowing them to solve task trees of twice the depth compared to the ones seen during training. Our open source code as well as videos of the agents can be found on \href{https://sites.google.com/view/collective-open-ended-explore}{our companion website} |
Richard Bornemann · Gautier Hamon · Eleni Nisioti · Clément Moulin-Frier 🔗 |
-
|
Does behavioral diversity in intrinsic rewards help exploration?
(
Poster
)
>
link
In recent years, intrinsic reward approaches have attracted the attention of the research community due to their ability to address various challenges in Reinforcement Learning, among which, exploration and diversity. Nevertheless, the two areas of study have seldom met. Many intrinsic rewards have been proposed to address the hard exploration problem by reducing the uncertainty of states/environment. Other intrinsic rewards were proposed to favor the agent's behavioral diversity, providing benefits of robustness, fast adaptation, and solving hierarchical tasks. We aim to investigate whether pushing for behavioral diversity can also be a way to favor exploration in sparse reward environments. The goal of this paper is to reinterpret the intrinsic reward approaches proposed in the literature, providing a new taxonomy based on the diversity level they impose on the exploration behavior, and complement it with an empirical study. Specifically, we define two main categories of exploration: "Where to explore'' and "How to explore''. The former favors exploration by imposing diversity on the states or state transitions (state and state + dynamics levels). The latter ("How to explore'') rather pushes the agent to discover diverse policies that can elicit diverse behaviors (policy and skill levels). In the literature, it is unclear how the second category behaves compared to the first category. Thus, we conduct an initial study on MiniGrid environment to compare the impact of selected intrinsic rewards imposing different diversity levels on a variety of tasks. |
Aya Kayal · Eduardo Pignatelli · Laura Toni 🔗 |
-
|
OMNI: Open-endedness via Models of human Notions of Interestingness
(
Spotlight
)
>
link
SlidesLive Video
Open-ended algorithms aim to learn new, interesting behaviors forever. That requires a vast environment search space, but there are thus infinitely many possible tasks. Even after filtering for tasks the current agent can learn (i.e., learning progress), countless learnable yet uninteresting tasks remain (e.g., minor variations of previously learned tasks). An Achilles Heel of open-endedness research is the inability to quantify (and thus prioritize) tasks that are not just learnable, but also $\textit{interesting}$ (e.g., worthwhile and novel). We propose solving this problem by $\textit{Open-endedness via Models of human Notions of Interestingness}$ (OMNI). The insight is that we can utilize large (language) models (LMs) as a model of interestingness (MoI), because they $\textit{already}$ internalize human concepts of interestingness from training on vast amounts of human-generated data, where humans naturally write about what they find interesting or boring. We show that LM-based MoIs improve open-ended learning by focusing on tasks that are both learnable $\textit{and interesting}$, outperforming baselines based on uniform task sampling or learning progress alone. This approach has the potential to dramatically advance the ability to intelligently select which tasks to focus on next (i.e., auto-curricula), and could be seen as AI selecting its own next task to learn, facilitating self-improving AI and AI-Generating Algorithms.
|
Jenny Zhang · Joel Lehman · Kenneth Stanley · Jeff Clune 🔗 |
-
|
Objectives Are All You Need: Solving Deceptive Problems Without Explicit Diversity Maintenance
(
Poster
)
>
link
Navigating deceptive domains has often been a challenge in machine learning due to search algorithms getting stuck at sub-optimal local optima. Many algorithms have been proposed to navigate these domains by explicitly maintaining diversity or equivalently promoting exploration, such as Novelty Search or other so-called Quality Diversity algorithms. In this paper, we present an approach with promise to solve deceptive domains without explicit diversity maintenance by optimizing a potentially large set of defined objectives. These objectives can be extracted directly from the environment by sub-aggregating the raw performance of individuals in a variety of ways. We use lexicase selection to optimize for these objectives as it has been shown to implicitly maintain population diversity. We compare this technique with a varying number of objectives to a commonly used quality diversity algorithm, MAP-Elites, on a set of discrete optimization as well as reinforcement learning domains with varying degrees of deception. We find that decomposing objectives into many objectives and optimizing them outperforms MAP-Elites on the deceptive domains that we explore. Furthermore, we find that this technique results in competitive performance on the diversity-focused metrics of QD-Score and Coverage, without explicitly optimizing for these things. Our ablation study shows that this technique is robust to different subaggregation techniques. However, when it comes to non-deceptive, or ``illumination" domains, quality diversity techniques generally outperform our objective-based framework with respect to exploration (but not exploitation), hinting at potential directions for future work. |
Ryan Boldi · Li Ding · Lee Spector 🔗 |
-
|
SmartPlay : A Benchmark for LLMs as Intelligent Agents
(
Poster
)
>
link
Recent large language models (LLMs) have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately.SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/LLMsmartplay/SmartPlay. |
Yue Wu · Xuan Tang · Tom Mitchell · Yuanzhi Li 🔗 |
-
|
Learning to Act without Actions
(
Poster
)
>
link
Pre-training large models on vast amounts of web data has proven to be an effective approach for obtaining powerful, general models in several domains, including language and vision. However, this paradigm has not yet taken hold in deep reinforcement learning (RL). This gap is due to the fact that the most abundant form of embodied behavioral data on the web consists of videos, which do not include the action labels required by existing methods for training policies from offline data. We introduce Latent Action Policies from Observation (LAPO), a method to infer latent actions and, consequently, latent-action policies purely from action-free demonstrations. Our experiments on challenging procedurally-generated environments show that LAPO can act as an effective pre-training method to obtain RL policies that can then be rapidly fine-tuned to expert-level performance. Our approach serves as a key stepping stone to enabling the pre-training of powerful, generalist RL models on the vast amounts of action-free demonstrations readily available on the web. |
Dominik Schmidt · Minqi Jiang 🔗 |
-
|
Adaptive Coalition Structure Generation
(
Poster
)
>
link
SlidesLive Video We introduce a Deep Reinforcement Learning (DRL) framework to form socially-optimal coalitions in an adaptive manner. In our approach, agents play a deal-or-no-deal game where each state represents a potential coalition to join. Agents learn to form coalitions that are mutually beneficial, without revealing the coalition value to each other. We conducted an empirical evaluation of our model's generalizability on a ridesharing spatial game. |
Lucia Cipolina Kun · Ignacio Carlucho · Kalesha Bullard 🔗 |
-
|
MCU: A Task-centric Framework for Open-ended Agent Evaluation in Minecraft
(
Poster
)
>
link
To pursue the goal of creating an open-ended agent in Minecraft, an open-ended game environment with unlimited possibilities, this paper introduces a novel task-centric framework named MCU for Minecraft agent evaluation. The MCU framework leverages the concept of atom tasks as fundamental building blocks, enabling the generation of diverse or evan arbitrary tasks. Within the MCU framework, each task is measured with 6 distinct difficulty scores (time consumption, operational effort, planning complexity, intricacy, creativity, novelty). These scores offer a multi-dimensional assessment of a task from different angles, and thus can reveal an agent's capability on specific facets. The difficulty scores also serve as the feature of each task, which creates a meaningful task space and unveils the relationship between tasks. For practical evaluation of Minecraft agents employing the MCU framework, we maintain two custom benchmarks, comprising tasks meticulously designed to evaluate the agents' proficiency in high-level planning and low-level control, respectively. We show that MCU has the high expressivity to cover all tasks used in recent literature on Minecraft agent, and underscores the need for advancements in areas such as creativity, precise control, and out-of-distribution generalization under the goal of open-ended Minecraft agent development. |
Haowei Lin · Zihao Wang · Jianzhu Ma · Yitao Liang 🔗 |
-
|
Exploration with Principles for Diverse AI Supervision
(
Poster
)
>
link
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from the principles of unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing large language models to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
Toward Open-ended Embodied Tasks Solving
(
Poster
)
>
link
SlidesLive Video Empowering embodied agents, such as robots, with Artificial Intelligence (AI) has become increasingly important in recent years. A major challenge is task open-endedness. In practice, robots often need to perform tasks with novel goals that are multifaceted, dynamic, lack a definitive "end-state", and were not encountered during training. To tackle this problem, this paper introduces \textit{Diffusion for Open-ended Goals} (DOG), a novel framework designed to enable embodied AI to plan and act flexibly and dynamically for open-ended task goals. DOG synergizes the generative prowess of diffusion models with state-of-the-art, training-free guidance techniques to adaptively perform online planning and control. Our evaluations demonstrate that DOG can handle various kinds of novel task goals not seen during training, in both maze navigation and robot control problems. Our work sheds light on enhancing embodied AI's adaptability and competency in tackling open-ended goals. |
Wei Wang · Dongqi Han · Xufang Luo · Yifei Shen · Charles Ling · Boyu Wang · Dongsheng Li 🔗 |
-
|
Mastering Memory Tasks with World Models
(
Poster
)
>
link
Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solve tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I establishes a new state-of-the-art performance in challenging memory and credit assignment RL tasks, such as Memory Maze, BSuite, and POPGym. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence. |
Mohammad Reza Samsami · Artem Zholus · Janarthanan Rajendran · Sarath Chandar 🔗 |
-
|
Mini-BEHAVIOR: A Procedurally Generated Benchmark for Long-horizon Decision-Making in Embodied AI
(
Poster
)
>
link
We present Mini-BEHAVIOR, a novel benchmark for embodied AI that challenges agents to use reasoning and decision-making skills to solve complex activities that resemble everyday human challenges. The Mini-BEHAVIOR environment is a fast, realistic Gridworld environment that offers the benefits of rapid prototyping and ease of use while preserving a symbolic level of physical realism and complexity found in complex embodied AI benchmarks. We introduce key features such as procedural generation, to enable the creation of countless task variations and support open-ended learning. Mini-BEHAVIOR provides implementations of various household tasks from the original BEHAVIOR benchmark, along with starter code for data collection and reinforcement learning agent training. In essence, Mini-BEHAVIOR offers a fast, open-ended benchmark for evaluating decision-making and planning solutions in embodied AI. It serves as a user-friendly entry point for research and facilitates the evaluation and development of solutions, simplifying their assessment and development while advancing the field of embodied AI. Code is available at https://anonymous.4open.science/r/mini_behavior-FEB4. |
Emily Jin · Jiaheng Hu · Zhuoyi Huang · Ruohan Zhang · Jiajun Wu · Fei-Fei Li · Roberto Martín-Martín 🔗 |
-
|
Quality-Diversity through AI Feedback
(
Poster
)
>
link
SlidesLive Video In many text-generation problems, users may prefer not only a single response, but a diverse range of high-quality outputs from which to choose. Quality-diversity (QD) search algorithms aim at such outcomes, by continually improving and diversifying a population of candidates. However, the applicability of QD to qualitative domains, like creative writing, has been limited by the difficulty of algorithmically specifying measures of quality and diversity. Interestingly, recent developments in language models (LMs) have enabled guiding search through \emph{AI feedback}, wherein LMs are prompted in natural language to evaluate qualitative aspects of text. Leveraging this development, we introduce Quality-Diversity through AI Feedback (QDAIF), wherein an evolutionary algorithm applies LMs to both generate variation and evaluate the quality and diversity of candidate text. In all but one creative writing domain, QDAIF covers more of a specified search space with high-quality samples than do non-QD controls. Further, human evaluation of QDAIF-generated creative texts validates reasonable agreement between AI and human evaluation. Our results thus highlight the potential of AI feedback to guide open-ended search for creative and original solutions, providing a recipe that seemingly generalizes to many domains and modalities. In this way, QDAIF is a step towards AI systems that can independently search, diversify, evaluate, and improve, which are among the core skills underlying human society's capacity for innovation. |
Herbie Bradley · Andrew Dai · Hannah Teufel · Jenny Zhang · Koen Oostermeijer · Marco Bellagente · Jeff Clune · Kenneth Stanley · Grégory Schott · Joel Lehman 🔗 |
-
|
Quality Diversity through Human Feedback
(
Spotlight
)
>
link
Reinforcement learning from human feedback (RLHF) has exhibited the potential to enhance the performance of foundation models for qualitative tasks. Despite its promise, its efficacy is often restricted when conceptualized merely as a mechanism to maximize learned reward models of averaged human preferences, especially in areas such as image generation which demand diverse model responses. Meanwhile, quality diversity (QD) algorithms, dedicated to seeking diverse, high-quality solutions, are often constrained by the dependency on manually defined diversity metrics. Interestingly, such limitations of RLHF and QD can be overcome by blending insights from both. This paper introduces Quality Diversity through Human Feedback (QDHF), which employs human feedback for inferring diversity metrics, expanding the applicability of QD algorithms. Empirical results reveal that QDHF outperforms existing QD methods regarding automatic diversity discovery, and matches the search capabilities of QD with human-constructed metrics. Notably, when deployed for a latent space illumination task, QDHF markedly enhances the diversity of images generated by a Diffusion model. The study concludes with an in-depth analysis of QDHF's sample efficiency and the quality of its derived diversity metrics, emphasizing its promise for enhancing exploration and diversity in optimization for complex, open-ended tasks. |
Li Ding · Jenny Zhang · Jeff Clune · Lee Spector · Joel Lehman 🔗 |
-
|
Mix-ME: Quality-Diversity for Multi-Agent Learning
(
Poster
)
>
link
In many real-world systems, such as adaptive robotics, achieving a single, optimised solution may be insufficient. Instead, a diverse set of high-performing solutions is often required to adapt to varying contexts and requirements. This is the realm of Quality-Diversity (QD), which aims to discover a collection of high-performing solutions, each with their own unique characteristics. QD methods have recently seen success in many domains, including robotics, where they have been used to discover damage-adaptive locomotion controllers. However, most existing work has focused on single-agent settings, despite many tasks of interest being multi-agent. To this end, we introduce Mix-ME, a novel multi-agent variant of the popular MAP-Elites algorithm that forms new solutions using a crossover-like operator by mixing together agents from different teams. We evaluate the proposed methods on a variety of partially observable continuous control tasks. Our evaluation shows that these multi-agent variants obtained by Mix-ME not only compete with single-agent baselines but also often outperform them in multi-agent settings under partial observability. |
Garðar Ingvarsson Juto · Mikayel Samvelyan · Manon Flageat · Bryan Lim · Antoine Cully · Tim Rocktäschel 🔗 |
-
|
What can AI Learn from Human Exploration? Intrinsically-Motivated Humans and Agents in Open-World Exploration
(
Poster
)
>
link
What drives exploration? Understanding intrinsic motivation is a long-standing question in both cognitive science and artificial intelligence (AI); numerous exploration objectives have been proposed and tested in human experiments and used to train reinforcement learning (RL) agents. However, experiments in the former are often in simplistic environments that do not capture the complexity of real world exploration. On the other hand, experiments in the latter use more complex environments yet the trained RL agents fail to come close to human exploration. In this work we directly compare human and agent exploration in a shared open-ended environment, Crafter (Hafner 2021). We study how well commonly-proposed information theoretic objectives for intrinsic motivation relate to actual human and agent behaviors, finding that human exploration consistently correlates significantly with entropy, information gain, and empowerment, whereas intrinsically-motivated RL agent exploration does not. We also analyze self-talk during play and find that children's verbalizations exhibit a significant relationship with empowerment and entropy, whereas adult verbalizations do not. |
Yuqing Du · Eliza Kosoy · Alyssa L Dayan · Maria Rufova · Alison Gopnik · Pieter Abbeel 🔗 |
-
|
How the level sampling process impacts zero-shot generalisation in deep reinforcement learning
(
Poster
)
>
link
A key limitation preventing the wider adoption of autonomous agents trained via deep reinforcement learning (RL) is their limited ability to generalise to new environments, even when these share similar characteristics with environments encountered during training. In this work, we investigate how a non-uniform sampling strategy of individual environment instances, or levels, affects the zero-shot generalisation (ZSG) ability of RL agents, considering two failure modes: overfitting and over-generalisation. As a first step, we measure the mutual information (MI) between the agent's internal representation and the set of training levels, which we find to be well-correlated to instance overfitting. In contrast to uniform sampling, adaptive sampling strategies prioritising levels based on their value loss are more effective at maintaining lower MI, which provides a novel theoretical justification for this class of techniques. We then turn our attention to unsupervised environment design (UED) methods, which adaptively generate new training levels and minimise MI more effectively than methods sampling from a fixed set. However, we find UED methods significantly shift the training distribution, resulting in over-generalisation and worse ZSG performance over the distribution of interest. To prevent both instance overfitting and over-generalisation, we introduce self-supervised environment design (SSED). SSED generates levels using a variational autoencoder, effectively reducing MI while minimising the shift with the distribution of interest, and leads to statistically significant improvements in ZSG over fixed-set level sampling strategies and UED methods. |
Samuel Garcin · James Doran · Shangmin Guo · Christopher G Lucas · Stefano Albrecht 🔗 |
-
|
Vision-Language Models as a Source of Rewards
(
Poster
)
>
link
Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents. |
Harris Chan · Volodymyr Mnih · Feryal Behbahani · Michael Laskin · Luyu Wang · Fabio Pardo · Maxime Gazeau · Himanshu Sahni · Daniel Horgan · Kate Baumli · Yannick Schroecker · Stephen Spencer · Richie Steigerwald · John Quan · Gheorghe Comanici · Sebastian Flennerhag · Alexander Neitz · Lei Zhang · Tom Schaul · Satinder Singh · Clare Lyle · Tim Rocktäschel · Jack Parker-Holder · Kristian Holsheimer
|
-
|
Discovering Temporally-Aware Reinforcement Learning Algorithms
(
Poster
)
>
link
Recent advancements in meta-learning have enabled the automatic discovery of novel reinforcement learning algorithms parameterized by surrogate objective functions. To improve upon manually designed algorithms, the parameterization of this learned objective function must be expressive enough to represent novel principles of learning (instead of merely recovering already established ones) while still generalizing to a wide range of settings outside of its meta-training distribution. However, existing methods focus on discovering objective functions that, like many widely used objective functions in reinforcement learning, do not take into account the total number of steps allowed for training, or “training horizon”. In contrast, humans use a plethora of different learning objectives across the course of acquiring a new ability. For instance, students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities. This paper contends that ignoring the optimization time horizon significantly restricts the expressive potential of discovered learning algorithms. We propose a simple augmentation to two existing objective discovery approaches that allows the discovered algorithm to dynamically update its objective function throughout the agent’s training procedure, resulting in expressive schedules and increased generalization across different training horizons. In the process, we find that commonly used meta-gradient approaches fail to discover such adaptive objective functions while evolution strategies discover highly dynamic learning rules. We demonstrate the effectiveness of our approach on a wide range of tasks and analyze the resulting learned algorithms, which we find effectively balance exploration and exploitation by modifying the structure of their learning rules throughout the agent’s lifetime. |
Matthew T Jackson · Chris Lu · Louis Kirsch · Robert Lange · Shimon Whiteson · Jakob Foerster 🔗 |
-
|
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
(
Poster
)
>
link
Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially “black boxes” to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs. |
Zhibin Gou · Zhihong Shao · Yeyun Gong · yelong shen · Yujiu Yang · Nan Duan · Weizhu Chen 🔗 |
-
|
RAVL: Reach-Aware Value Learning for the Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning
(
Poster
)
>
link
Training generalist agents requires learning in complex, open-ended environments. In the real world, as well as in standard benchmarks, such environments often come with large quantities of pre-collected behavioral data. Offline reinforcement learning presents an exciting possibility for leveraging this existing data to kickstart subsequent expensive open-ended learning. Using offline data with RL, however, introduces the additional challenge of evaluating values for state-actions not seen in the dataset -- termed the out-of-sample problem. One solution to this is by allowing the agent to generate additional synthetic data through rollouts in a learned dynamics model. The prevailing theoretical understanding is that this effectively resolves the out-of-sample issue, and that any remaining difficulties are due to errors in the learned dynamics model. Based on this understanding, one would expect improvements to the dynamics model to lead to improvements to the learned policy. Surprisingly, however, we find that existing algorithms completely fail when the true dynamics are provided in place of the learned dynamics model. This observation exposes a common misconception in offline reinforcement learning, namely that dynamics model errors do not explain the behavior of model-based methods. Our subsequent investigation reveals a second major and previously overlooked issue in offline model-based reinforcement learning (which we term the edge-of-reach problem). Guided by this new insight, we propose Reach-Aware Value Learning (RAVL), a value-based algorithm that is able to capture value uncertainty at edge-of-reach states and resolve the edge-of-reach problem. Our method achieves strong performance on the standard D4RL benchmark, and we hope that the insights developed in this paper help to advance offline RL in order for it to serve as an easily applicable pre-training technique for open-ended settings. |
Anya Sims · Cong Lu · Yee Whye Teh 🔗 |
-
|
Improving Intrinsic Exploration by Creating Stationary Objectives
(
Poster
)
>
link
SlidesLive Video Exploration bonuses in reinforcement learning guide long-horizon exploration by defining custom intrinsic objectives. Count-based methods use the frequency of state visits to derive an exploration bonus. In this paper, we identify that any intrinsic reward function derived from count-based methods is non-stationary and hence induces a difficult objective to optimize for the agent. The key contribution of our work lies in transforming the original non-stationary rewards into stationary rewards through an augmented state representation. For this purpose, we introduce the Stationary Objectives For Exploration (SOFE) framework. SOFE requires identifying sufficient statistics for different exploration bonuses and finding an efficient encoding of these statistics to use as input to a deep network. SOFE is based on proposing state augmentations that expand the state space but hold the promise of simplifying the optimization of the agent's objective. Our experiments show that SOFE improves the agents' performance in challenging exploration problems, including sparse-reward tasks, pixel-based observations, 3D navigation, and procedurally generated environments. |
Roger Creus Castanyer · Joshua Romoff · Glen Berseth 🔗 |
-
|
Training Reinforcement Learning Agents and Humans with Difficulty-Conditioned Generators
(
Poster
)
>
link
We introduce Parameterized Environment Response Model (PERM), a method for training both Reinforcement Learning (RL) Agents and human learners in parameterized environments by directly modeling difficulty and ability. Inspired by Item Response Theory (IRT), PERM aligns environment difficulty with individual ability, creating a Zone of Proximal Development-based curriculum. Remarkably, PERM operates without real-time RL updates and allows for offline training, ensuring its adaptability across diverse students. We present a two-stage training process that capitalizes on PERM's adaptability, and demonstrate its effectiveness in training RL agents and humans in an empirical study. |
Sidney Tio · Pradeep Varakantham 🔗 |
-
|
Eureka: Human-Level Reward Design via Coding Large Language Models
(
Spotlight
)
>
link
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed. |
Jason Ma · William Liang · Guanzhi Wang · De-An Huang · Osbert Bastani · Dinesh Jayaraman · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Quality Diversity in the Amorphous Fortress: Evolving for Complexity in 0-Player Games
(
Poster
)
>
link
We explore the generation of diverse environments using the Amorphous Fortress (AF) simulation framework. AF defines a set of Finite State Machine (FSM) nodes and edges that can be recombined to control the behavior of agents in the `fortress' grid-world. The behaviors and conditions of the agents within the framework are designed to capture the common building blocks of multi-agent artificial life and reinforcement learning environments. Using quality diversity evolutionary search, we generate diverse sets of environments that exhibit dynamics exhibiting certain types of complexity according to measures of agents' FSM architectures and activations, and collective behaviors. QD-AF generates families of 0-player akin to simplistic ecological models, and we identify the emergence of both competitive and co-operative multi-agent and multi-species survival dynamics. We argue that these generated worlds can collectively serve as training and testing grounds for learning algorithms. |
Sam Earle · M Charity · Julian Togelius · Dipika Rajesh 🔗 |
-
|
SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
(
Poster
)
>
link
Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents. |
Xuhui Zhou · Hao Zhu · Leena Mathur · Ruohong Zhang · Haofei Yu · Zhengyang Qi · Louis-Philippe Morency · Yonatan Bisk · Daniel Fried · Graham Neubig · Maarten Sap
|
-
|
HomeRobot: Open-Vocabulary Mobile Manipulation
(
Poster
)
>
link
HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks.Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM and for eventually building robust open-ended learning systems. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. |
Sriram Yenamandra · Arun Ramachandran · Karmesh Yadav · Austin Wang · Mukul Khanna · Theophile Gervet · Tsung-Yen Yang · Vidhi Jain · Alexander Clegg · John Turner · Zsolt Kira · Manolis Savva · Angel Chang · Devendra Singh Chaplot · Dhruv Batra · Roozbeh Mottaghi · Yonatan Bisk · Chris Paxton
|
-
|
AgentTorch: Agent-based Modeling with Automatic Differentiation
(
Poster
)
>
link
Agent-based models (ABMs) are discrete simulators comprising agents that can act and interact in a computational world. ABMs are relevant across several disciplines as these agents can be cells in bio-electric networks, humans in physical networks, or even AI avatars in digital networks. Despite wide applicability, research in ABMs has been extremely fragmented and has not benefited from modern computational advances, especially automatic differentiation. This paper presents AgentTorch: a framework to design, simulate, and optimize agent-based models. AgentTorch definition can be used to build stochastic, non-linear ABMs across digital, biological, and physical realms; while ensuring gradient flow through all simulation steps. AgentTorch simulations are fully tensorized, execute on GPUsand can range from a few hundred agents in synthetic grids to millions of agents in real-world contact graphs. The end-to-end differentiability of AgentTorch enables automatic differentiation of simulation parameters and integration with deep neural networks (DNNs) in several ways, for both supervised and reinforcement learning. We validate AgentTorch through multiple case studies that study cell morphogenesis over bio-electric networks, infection disease epidemiology over physical networks and opinion dynamics over social networks. AgentTorch is designed to be a viable toolkit for scientific exploration and real-world policy decision-making. We hope AgentTorch can help bridge research in AI and agent-based modeling. |
Ayush Chopra · Jayakumar Subramanian · Balaji Krishnamurthy · Ramesh Raskar 🔗 |
-
|
Diversity from Human Feedback
(
Poster
)
>
link
SlidesLive Video Diversity plays a significant role in many problems, such as ensemble learning, reinforcement learning, and combinatorial optimization. Though having great many successful applications in machine learning, most methods need to define a proper behavior space, which is, however, challenging for the human in many scenarios. In this paper, we propose the problem of learning a behavior space from human feedback and introduce a general method called Diversity from Human Feedback (DivHF) to solve it. DivHF learns a behavior descriptor function consistent with human preference by querying human feedback. The learned behavior descriptor can be combined with any distance measure to define a diversity measure. We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite. The results show that DivHF learns a behavior space that aligns better with human requirements compared to direct data-driven approaches and leads to more diverse solutions under human preference. Our contributions include formulating the problem, proposing the DivHF method, and demonstrating its effectiveness through experiments. |
Ren-Jian Wang · Ke Xue · Yutong Wang · Peng Yang · Haobo Fu · Qiang Fu · Chao Qian 🔗 |
-
|
Unlocking the Power of Representations in Long-term Novelty-based Exploration
(
Poster
)
>
link
We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of episodes. We further propose a novel generalization of the inverse dynamics loss, which leverages masked transformer architectures for multi-step prediction; which in conjunction with RECODE achieves a new state-of-the-art in a suite of challenging 3D-exploration tasks in DM-HARD-8. RECODE also sets new state-of-the-art in hard exploration Atari games, and is the first agent to reach the end screen in Pitfall! |
Steven Kapturowski · Alaa Saade · Daniele Calandriello · Charles Blundell · Pablo Sprechmann · Leopoldo Sarra · Oliver Groth · Michal Valko · Bilal Piot 🔗 |
-
|
Diverse Offline Imitation Learning
(
Poster
)
>
link
There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system. |
Marin Vlastelica Pogančić · Jin Cheng · Georg Martius · Pavel Kolev 🔗 |
-
|
From Centralized to Self-Supervised: Pursuing Realistic Multi-Agent Reinforcement Learning
(
Poster
)
>
link
In real-world environments, autonomous agents rely on their egocentric observations. They must learn adaptive strategies to interact with others who possess mixed motivations, discernible only through visible cues. Several Multi-Agent Reinforcement Learning (MARL) methods adopt centralized approaches that involve either centralized training or reward-sharing, often violating the realistic ways in which living organisms, like animals or humans, process information and interact. MARL strategies deploying decentralized training with intrinsic motivation offer a self-supervised approach, enable agents to develop flexible social strategies through the interaction of autonomous agents. However, by contrasting the self-supervised and centralized methods, we reveal that populations trained with reward-sharing methods surpass those using self-supervised methods in a mixed-motive environment. We link this superiority to specialized role emergence and an agent's expertise in its role. Interestingly, this gap shrinks in pure-motive settings, emphasizing the need for evaluations in more complex, realistic environments (mixed-motive). Our preliminary results suggest a gap in population performance that can be closed by improving self-supervised methods and thereby pushing MARL closer to real-world readiness. |
Violet Xiang · Logan Cross · Jan-Philipp Fraenken · Nick Haber 🔗 |
-
|
JaxMARL: Multi-Agent RL Environments in JAX
(
Poster
)
>
link
Benchmarks play an important role in the development of machine learning algorithms. Reinforcement learning environments are traditionally run on the CPU, limiting their scalability with typical academic compute. However, recent advancements in JAX have enabled the wider use of hardware acceleration to overcome these computational hurdles by producing massively parallel RL training pipelines and environments.This is particularly useful for multi-agent reinforcement learning (MARL) research where not only multiple agents must be considered at each environment step, adding additional computational burden, but also the sample complexity is increased due to non-stationarity, decentralised partial observability, or other MARL challenges. In this paper, we present JaxMARL, the first open-source code base that combines ease-of-use with GPU enabled efficiency, and supports a large number of commonly used MARL environments as well as popular baseline algorithms. Our experiments show that our JAX-based implementations are up to 1400x faster than existing single-threaded baselines. This enables efficient and thorough evaluations, with the potential to alleviate the evaluation crisis of the field. We also introduce and benchmark SMAX, a vectorised, simplified version of the StarCraft Multi-Agent Challenge, which removes the need to run the StarCraft II game engine. This not only enables GPU acceleration, but also provides a more flexible MARL environment, unlocking the potential for self-play, meta-learning, and other future applications in MARL. |
Alexander Rutherford · Benjamin Ellis · Matteo Gallici · Jonathan Cook · Andrei Lupu · Garðar Ingvarsson Juto · Timon Willi · Akbir Khan · Christian Schroeder de Witt · Alexandra Souly · Saptarashmi Bandyopadhyay · Mikayel Samvelyan · Minqi Jiang · Robert Lange · Shimon Whiteson · Bruno Lacerda · Nick Hawes · Tim Rocktäschel · Chris Lu · Jakob Foerster
|
-
|
CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization
(
Poster
)
>
link
SlidesLive Video Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time, beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory, centered on causal abstractions (rather than general ''helpful hints''), that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time. |
Bodhisattwa Prasad Majumder · Bhavana Dalvi Mishra · Peter A Jansen · Oyvind Tafjord · Niket Tandon · Li Zhang · Chris Callison-Burch · Peter Clark 🔗 |
-
|
Skill-Conditioned Policy Optimization with Successor Features Representations
(
Poster
)
>
link
A key aspect of intelligence is the ability to exhibit a wide range of behaviors to adapt to unforeseen situations. Designing artificial agents that are capable of showcasing a broad spectrum of skills is a long-standing challenge in Artificial Intelligence. In the last decade, progress in deep Reinforcement Learning (RL) has enabled to solve complex tasks with high-dimensional, continuous state and action spaces. However, most approaches return only one highly-specialized solution to a single problem. We introduce a Skill-Conditioned OPtimal Agent (SCOPA) that leverages successor features representations to learn skills that solve a task. We derive a policy skill improvement update with successor features analogous to the classic policy improvement update, that we use to learn skills. From this result, we develop an algorithm that combines successor features with universal function approximators to learn a skill representation that extends the traditional concept of goal to trajectory-based skill. We seamlessly unify value function and successor features policy iteration with constrained optimization to (1) maximize performance while (2) executing a skill. Compared with other skill-conditioned RL methods, SCOPA reaches significantly higher performance and skill space coverage on challenging continuous control locomotion tasks with various types of skills. We also demonstrate that the diversity of skills is useful in downstream adaptation tasks. Videos of our results are available at: http://bit.ly/scopa. |
Luca Grillotti · Maxence Faldor · Borja G. León · Antoine Cully 🔗 |
-
|
AssemblyCA: A Benchmark of Open-Endedness for Discrete Cellular Automata
(
Poster
)
>
link
We introduce AssemblyCA, a framework for utilizing cellular automata(CA) designed to benchmark the potential of open-ended processes. The benchmark quantifies the open-endedness of a system composed of resources, agents interacting with CAs, and a set of generated artifacts. We quantify the amount of open-endedness by taking the generated artifacts or objects and analyzing them using the tools of assembly theory(AT). Assembly theory can be used to identify selection in systems that produce objects that can be decomposable into atomic units, where these objects can exist in high copy numbers. By combining an assembly space measure with the copy number of an object we can quantify the complexity of objects that have a historical contingency. Moreover, this framework allows us to accurately quantify the indefinite generation of novel, diverse, and complex objects, the signature of open-endedness. We benchmark different measures from the assembly space with standard diversity and complexity measures that lack historical contingency. Finally, the open-endedness of three different systems is quantified by performing an undirected exploration in two-dimensional life-like CA, a cultural exploration provided by human experimenters, and an algorithmic exploration by a set of programmed agents. |
Keith Patarroyo · Abhishek Sharma · Sara Walker · Lee Cronin 🔗 |
-
|
PufferLib: Making Reinforcement Learning Libraries and Environments Play Nice
(
Poster
)
>
link
Common simplifying assumptions often cause standard reinforcement learning (RL) methods to fail on complex, open-ended environments. Creating a new wrapper for each environment and learning library can help alleviate these limitations, but building them is labor-intensive and error-prone. This practical tooling gap restricts the applicability of RL as a whole. To address this challenge, PufferLib transforms complex environments into a broadly compatible, vectorized format that eliminates the need for bespoke conversion layers and enables rigorous cross-environment testing. PufferLib does this without deviating from standard reinforcement learning APIs, significantly reducing the technical overhead. We release PufferLib's complete source code under the MIT license, a pip module, a containerized setup, comprehensive documentation, and example integrations. We also maintain a community Discord channel to facilitate support and discussion. |
Joseph Suarez 🔗 |
-
|
t-DGR: A Trajectory-Based Deep Generative Replay Method for Continual Learning in Decision Making
(
Poster
)
>
link
Deep generative replay has emerged as a promising approach for continual learning in decision-making tasks. This approach addresses the problem of catastrophic forgetting by leveraging the generation of trajectories from previously encountered tasks to augment the current dataset. However, existing deep generative replay methods for continual learning rely on autoregressive models, which suffer from compounding errors in the generated trajectories. In this paper, we propose a simple, scalable, and non-autoregressive method for continual learning in decision-making tasks using a diffusion model that generates task samples conditioned on the trajectory timestep. We evaluate our method on Continual World benchmarks and find that our approach achieves state-of-the-art performance on the average success rate metric compared to other continual learning methods. |
William Yue · Bo Liu · Peter Stone 🔗 |
-
|
Procedural generation of meta-reinforcement learning tasks
(
Poster
)
>
link
Open-endedness stands to benefit from the ability to generate an infinite variety of diverse, challenging environments. One particularly interesting type of challenge is meta-learning (``learning-to-learn''), a hallmark of intelligent behavior. However, the number of meta-learning environments in the literature is limited. Here we describe a parametrized space for simple meta-reinforcement learning (meta-RL) tasks with arbitrary stimuli. The parametrization allows us to randomly generate an arbitrary number of novel simple meta-learning tasks. The parametrization is expressive enough to include many well-known meta-RL tasks, such as bandit tasks, the Harlow task, T-mazes, the Daw two-step task and others. Simple extensions allow it to capture tasks based on two-dimensional topological spaces, such as find-the-spot or key-door tasks. We describe a number of randomly generated meta-RL tasks and discuss potential issues arising from random generation. |
Thomas Miconi 🔗 |
-
|
Curriculum Learning from Smart Retail Investors: Towards Financial Open-endedness
(
Poster
)
>
link
The integration of data-driven supervised learning and reinforcement learning has demonstrated promising potential for stock trading. It has been observed that introducing training examples to a learning algorithm in a meaningful order or sequence, known as curriculum learning, can speed up convergence and yield improved solutions. In this paper, we present a financial curriculum learning method that achieves superhuman performance in automated stock trading. First, with high-quality financial datasets from smart retail investors, such as trading logs, training our algorithm through imitation learning results in a reasonably competent solution. Subsequently, leveraging reinforcement learning techniques in a second stage, we develop a novel curriculum learning strategy that helps traders beat the stock market. |
Kent Wu · Ziyi Xia · Shuaiyu Chen · Xiao-Yang Liu 🔗 |
-
|
Melanie Mitchell
(
Invited
)
>
|
🔗 |