Workshop
Foundation Models for Decision Making
Sherry Yang · Ofir Nachum · Yilun Du · Stephen McAleer · Igor Mordatch · Linxi Fan · Jeannette Bohg · Dale Schuurmans
Hall E2 (level 1)
Foundation models pretrained on diverse vision and language datasets have demonstrated exceptional capabilities in performing a wide range of downstream vision and language tasks. As foundation models are deployed in real-world applications such as dialogue, autonomous driving, healthcare, and robotics, they inevitably face new challenges such as learning from external feedback, adapting to different task modalities, and performing long-term reasoning and planning. Such challenges have traditionally been at the core of sequential decision making, encompassing areas such as reinforcement learning, imitation learning, planning, search, and optimal control. These research fields have traditionally focused on task-specific settings with limited prior knowledge, and yet there has been significant research progress in surpassing human performance in tasks like playing board games and Atari video games, as well as operating robots to complete navigation and manipulation tasks. However, since these methods generally learn to solve a specific task from scratch without broad knowledge from vision and language, they can struggle with generalization and sample efficiency. The goal of this workshop is to bring together the sequential decision making community including planning, search, RL, and optimal control, together with the foundation models community in vision and language to confront the challenges in decision making at scale. The workshop will span high-level discussions on how foundation models and decision making can benefit each other when jointly considered and low-level algorithmic details of various decision making algorithms and vision-language architectures, which might lead to both opportunities or challenges. Specific topics, for example, will include foundation model agents interacting with humans, computers, tools, simulators, physical world, and each other.
Schedule
Fri 6:15 a.m. - 7:20 a.m.
|
Poster Session I
(
Poster Session
)
>
|
🔗 |
Fri 7:20 a.m. - 7:30 a.m.
|
Opening Remark
(
Opening Remark
)
>
SlidesLive Video |
🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Percy Liang
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Ruslan Salakhutdinov
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Jürgen Schmidhuber
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 9:00 a.m. - 9:10 a.m.
|
Universal Visual Decomposer: Long-Horizon Manipulation Made Easy
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:10 a.m. - 9:20 a.m.
|
Goal Masked Diffusion Policies for Unified Navigation and Exploration
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:20 a.m. - 9:30 a.m.
|
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:30 a.m. - 9:40 a.m.
|
WebArena: A Realistic Web Environment for Building Autonomous Agents
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:40 a.m. - 9:50 a.m.
|
Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:50 a.m. - 10:00 a.m.
|
Benchmarking Large Language Models as AI Research Agents
(
Oral Presentation
)
>
SlidesLive Video |
🔗 |
Fri 10:00 a.m. - 11:00 a.m.
|
Lunch & Poster Session II
(
Poster Session
)
>
|
🔗 |
Fri 11:00 a.m. - 11:30 a.m.
|
Phillip Isola
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Xinyun Chen
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Chelsea Finn
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 12:30 p.m. - 1:00 p.m.
|
Russ Tedrake
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 1:00 p.m. - 1:45 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Kristen Grauman
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Fri 2:00 p.m. - 3:30 p.m.
|
Poster Session III
(
Poster Session
)
>
|
🔗 |
Fri 2:00 p.m. - 2:15 p.m.
|
Industry Demos
(
Industry Demos
)
>
SlidesLive Video |
🔗 |
-
|
WebArena: A Realistic Web Environment for Building Autonomous Agents
(
Poster
)
>
link
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that \ours can be used to measure such progress. |
Shuyan Zhou · Frank F. Xu · Hao Zhu · Xuhui Zhou · Robert Lo · Abishek Sridhar · Xianyi Cheng · Tianyue Ou · Yonatan Bisk · Daniel Fried · Uri Alon · Graham Neubig
|
-
|
WebArena: A Realistic Web Environment for Building Autonomous Agents
(
Oral
)
>
link
With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that \ours can be used to measure such progress. |
Shuyan Zhou · Frank F. Xu · Hao Zhu · Xuhui Zhou · Robert Lo · Abishek Sridhar · Xianyi Cheng · Tianyue Ou · Yonatan Bisk · Daniel Fried · Uri Alon · Graham Neubig
|
-
|
Towards General-Purpose In-Context Learning Agents
(
Poster
)
>
link
Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
-
|
Towards General-Purpose In-Context Learning Agents
(
Oral
)
>
link
Reinforcement Learning (RL) algorithms are usually hand-crafted, driven by the research and engineering of humans. An alternative approach is to automate this research process via meta-learning. A particularly ambitious objective is to automatically discover new RL algorithms from scratch that use in-context learning to learn-how-to-learn entirely from data while also generalizing to a wide range of environments. Those RL algorithms are implemented entirely in neural networks, by conditioning on previous experience from the environment, without any explicit optimization-based routine at meta-test time. To achieve generalization, this requires a broad task distribution of diverse and challenging environments. Our Transformer-based Generally Learning Agents (GLAs) are an important first step in this direction. Our GLAs are meta-trained using supervised learning techniques on an offline dataset with experiences from RL environments that is augmented with random projections to generate task diversity. During meta-testing our agents perform in-context meta-RL on entirely different robotic control problems such as Reacher, Cartpole, or HalfCheetah that were not in the meta-training distribution. |
Louis Kirsch · James Harrison · Daniel Freeman · Jascha Sohl-Dickstein · Jürgen Schmidhuber 🔗 |
-
|
Agnostic Architecture for Heterogeneous Multi-Environment Reinforcement Learning
(
Poster
)
>
link
Training an agent from scratch for every new environment is inefficient in reinforcement learning. Suppose the agent can learn across diverse environments in any form, acquire a wealth of prior knowledge, and effectively perform transfer learning. In that case, it can save both computational and temporal costs significantly. However, performing pretraining and transfer learning across multiple environments is challenging due to differences in states and actions among RL problems. One can employ parameter sharing for environment-specific architecture with layers handling each different state-action space and add new layers for each new environment to enable transfer learning. While this approach performs well in pretraining. However, we found it lacks efficacy in transfer learning. To address these issues, we introduce a flexible and agnostic architecture capable of learning across multiple environments simultaneously and enabling transfer learning when new environments are coming. For this architecture, we propose algorithms that make multi-environment training more efficient for online and offline RL settings. |
Kukjin Kim · Changhee Joo 🔗 |
-
|
Agnostic Architecture for Heterogeneous Multi-Environment Reinforcement Learning
(
Oral
)
>
link
Training an agent from scratch for every new environment is inefficient in reinforcement learning. Suppose the agent can learn across diverse environments in any form, acquire a wealth of prior knowledge, and effectively perform transfer learning. In that case, it can save both computational and temporal costs significantly. However, performing pretraining and transfer learning across multiple environments is challenging due to differences in states and actions among RL problems. One can employ parameter sharing for environment-specific architecture with layers handling each different state-action space and add new layers for each new environment to enable transfer learning. While this approach performs well in pretraining. However, we found it lacks efficacy in transfer learning. To address these issues, we introduce a flexible and agnostic architecture capable of learning across multiple environments simultaneously and enabling transfer learning when new environments are coming. For this architecture, we propose algorithms that make multi-environment training more efficient for online and offline RL settings. |
Kukjin Kim · Changhee Joo 🔗 |
-
|
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment
(
Poster
)
>
link
Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior involves Reinforcement Learning with Human Feedback (RLHF), with Proximal Policy Optimization (PPO) serving as the default RL optimizer. Despite its effectiveness, PPO has limitations when optimizing rewards trained from comparison-based loss. Primarily, PPO is not invariant to equivalent reward functions containing identical preference information due to the need to calibrate the reward scale. Additionally, PPO's necessity for token-wise updates introduces complexity in both function approximation and algorithm design compared to trajectory-wise optimization. This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O) that operates directly on comparative rewards. We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods. In summary, this work introduces a simpler yet effective approach for aligning LLMs to human preferences through relative feedback. |
Tianhao Wu · Banghua Zhu · Ruoyu Zhang · Zhaojin Wen · Kannan Ramchandran · Jiantao Jiao 🔗 |
-
|
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment
(
Oral
)
>
link
Large Language Models (LLMs) can acquire extensive world knowledge through pre-training on large corpora. However, due to exposure to low-quality data, LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior involves Reinforcement Learning with Human Feedback (RLHF), with Proximal Policy Optimization (PPO) serving as the default RL optimizer. Despite its effectiveness, PPO has limitations when optimizing rewards trained from comparison-based loss. Primarily, PPO is not invariant to equivalent reward functions containing identical preference information due to the need to calibrate the reward scale. Additionally, PPO's necessity for token-wise updates introduces complexity in both function approximation and algorithm design compared to trajectory-wise optimization. This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O) that operates directly on comparative rewards. We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods. In summary, this work introduces a simpler yet effective approach for aligning LLMs to human preferences through relative feedback. |
Tianhao Wu · Banghua Zhu · Ruoyu Zhang · Zhaojin Wen · Kannan Ramchandran · Jiantao Jiao 🔗 |
-
|
Target Rate Optimization: Avoiding Iterative Error Exploitation
(
Poster
)
>
link
Large models solve hard problems in supervised, unsupervised, and bandit learning. Unfortunately, even with large models, many real-world reinforcement learning (RL) problems remain intractable. A key issue is that sample-efficient RL algorithms are unstable. Early stopping sometimes works around this. But, partly because of the instability itself, early stopping is difficult. Further, the standard approach to early stopping in RL is to stop all learning. Why don't we instead fix the early stopping that most target networks implicitly use already? That is, in algorithms like DQN, the target update rate already early-stops DQN's target-fitting subproblems. Currently, practitioners must either hope the default target rate performs well, or tune it with an expensive grid search over online returns. Moreover, within a run, algorithms like DQN continue to update the target even when the updates increase the training error. This degrades value estimates, which degrades returns. Newer off-policy and offline RL algorithms lessen this well-known deadly triad divergence, but still often fail to capture peak returns. To combat these issues, we propose adding optimization of the training error w.r.t. the target update rate. Our algorithm, Target Rate Optimization, empirically prevents divergence and increases return on both discrete- and continuous-action RL problems. |
Braham Snyder · Amy Zhang · Yuke Zhu 🔗 |
-
|
Target Rate Optimization: Avoiding Iterative Error Exploitation
(
Oral
)
>
link
Large models solve hard problems in supervised, unsupervised, and bandit learning. Unfortunately, even with large models, many real-world reinforcement learning (RL) problems remain intractable. A key issue is that sample-efficient RL algorithms are unstable. Early stopping sometimes works around this. But, partly because of the instability itself, early stopping is difficult. Further, the standard approach to early stopping in RL is to stop all learning. Why don't we instead fix the early stopping that most target networks implicitly use already? That is, in algorithms like DQN, the target update rate already early-stops DQN's target-fitting subproblems. Currently, practitioners must either hope the default target rate performs well, or tune it with an expensive grid search over online returns. Moreover, within a run, algorithms like DQN continue to update the target even when the updates increase the training error. This degrades value estimates, which degrades returns. Newer off-policy and offline RL algorithms lessen this well-known deadly triad divergence, but still often fail to capture peak returns. To combat these issues, we propose adding optimization of the training error w.r.t. the target update rate. Our algorithm, Target Rate Optimization, empirically prevents divergence and increases return on both discrete- and continuous-action RL problems. |
Braham Snyder · Amy Zhang · Yuke Zhu 🔗 |
-
|
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
(
Poster
)
>
link
Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL.Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.
|
Ruizhe Shi · Yuyao Liu · Yanjie Ze · Simon Du · Huazhe Xu 🔗 |
-
|
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
(
Oral
)
>
link
Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL.Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples.
|
Ruizhe Shi · Yuyao Liu · Yanjie Ze · Simon Du · Huazhe Xu 🔗 |
-
|
H-GAP: Humanoid Control with a Generalist Planner
(
Poster
)
>
link
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations.The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC).For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviours to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines with access to the ground truth model, and is superior or comparable to offline RL methods trained for individual tasks.Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. |
zhengyao Jiang · Yingchen Xu · Nolan Wagener · Yicheng Luo · Michael Janner · Edward Grefenstette · Tim Rocktäschel · Yuandong Tian 🔗 |
-
|
H-GAP: Humanoid Control with a Generalist Planner
(
Oral
)
>
link
Humanoid control is an important research challenge offering avenues for integration into human-centric infrastructures and enabling physics-driven humanoid animations.The daunting challenges in this field stem from the difficulty of optimizing in high-dimensional action spaces and the instability introduced by the bipedal morphology of humanoids. However, the extensive collection of human motion-captured data and the derived datasets of humanoid trajectories, such as MoCapAct, paves the way to tackle these challenges. In this context, we present Humanoid Generalist Autoencoding Planner (H-GAP), a state-action trajectory generative model trained on humanoid trajectories derived from human motion-captured data, capable of adeptly handling downstream control tasks with Model Predictive Control (MPC).For 56 degrees of freedom humanoid, we empirically demonstrate that H-GAP learns to represent and generate a wide range of motor behaviours. Further, without any learning from online interactions, it can also flexibly transfer these behaviours to solve novel downstream control tasks via planning. Notably, H-GAP excels established MPC baselines with access to the ground truth model, and is superior or comparable to offline RL methods trained for individual tasks.Finally, we do a series of empirical studies on the scaling properties of H-GAP, showing the potential for performance gains via additional data but not computing. |
zhengyao Jiang · Yingchen Xu · Nolan Wagener · Yicheng Luo · Michael Janner · Edward Grefenstette · Tim Rocktäschel · Yuandong Tian 🔗 |
-
|
Reasoning about Action Preconditions with Programs
(
Poster
)
>
link
One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we decompose the problem of learning an agent policy for sequential decision making tasks into the sub-problems of precondition inference and action prediction. We show that these sub-problems can be formulated as code-completion problems and exploit pre-trained code understanding models to tackle them. We demonstrate that the proposed code representation coupled with our novel precondition-aware action prediction strategy outperforms prior policy learning approaches in a few-shot learning setting across task-oriented dialog and embodied textworld benchmarks. |
Lajanugen Logeswaran · Sungryull Sohn · Yiwei Lyu · Anthony Liu · Dong-Ki Kim · Dongsub Shim · Moontae Lee · Honglak Lee 🔗 |
-
|
Reasoning about Action Preconditions with Programs
(
Oral
)
>
link
One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we decompose the problem of learning an agent policy for sequential decision making tasks into the sub-problems of precondition inference and action prediction. We show that these sub-problems can be formulated as code-completion problems and exploit pre-trained code understanding models to tackle them. We demonstrate that the proposed code representation coupled with our novel precondition-aware action prediction strategy outperforms prior policy learning approaches in a few-shot learning setting across task-oriented dialog and embodied textworld benchmarks. |
Lajanugen Logeswaran · Sungryull Sohn · Yiwei Lyu · Anthony Liu · Dong-Ki Kim · Dongsub Shim · Moontae Lee · Honglak Lee 🔗 |
-
|
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
(
Poster
)
>
link
Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets. |
Raphael Schumann · Wanrong Zhu · Weixi Feng · Tsu-Jui Fu · Stefan Riezler · William Yang Wang 🔗 |
-
|
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View
(
Oral
)
>
link
Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation (VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets. |
Raphael Schumann · Wanrong Zhu · Weixi Feng · Tsu-Jui Fu · Stefan Riezler · William Yang Wang 🔗 |
-
|
Chain of Code: Reasoning with a Language Model-Augmented Code Interpreter
(
Poster
)
>
link
Given language models (LMs) perform well on reasoning tasks and coding tasks, we hypothesize that code-writing also allows LMs to solve non-coding tasks by applying similar logical and algorithmic reasoning. However, not all such reasoning can be grounded in executable code. For example, while LMs can directly write Python-executable code to solve math problems, for problems like inferring how many countries have land-locked capital cities, the LM may write “code” with functions such as “iscitylandlocked()”, which cannot be directly executed. Our intuition is that an LM can not only generate code to solve many problems, but when necessary it can “think in code” (or even “pseudocode”) to arrive at more effective problem-solving strategies – generating executable code at times and simulating the execution of any ill-defined statements when useful. In this work, we propose Chain of Code (CoC), a method that improves LMs’ reasoning capabilities by first prompting LMs to write code to solve a task, and then, during code execution, using LMs to simulate the execution of code lines that could not be executed. We demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a significant gain of 12% over Chain of Thought. |
Chengshu Li · Jacky Liang · Fei Xia · Andy Zeng · Sergey Levine · Dorsa Sadigh · Karol Hausman · Xinyun Chen · Fei-Fei Li · brian ichter 🔗 |
-
|
Chain of Code: Reasoning with a Language Model-Augmented Code Interpreter
(
Oral
)
>
link
Given language models (LMs) perform well on reasoning tasks and coding tasks, we hypothesize that code-writing also allows LMs to solve non-coding tasks by applying similar logical and algorithmic reasoning. However, not all such reasoning can be grounded in executable code. For example, while LMs can directly write Python-executable code to solve math problems, for problems like inferring how many countries have land-locked capital cities, the LM may write “code” with functions such as “iscitylandlocked()”, which cannot be directly executed. Our intuition is that an LM can not only generate code to solve many problems, but when necessary it can “think in code” (or even “pseudocode”) to arrive at more effective problem-solving strategies – generating executable code at times and simulating the execution of any ill-defined statements when useful. In this work, we propose Chain of Code (CoC), a method that improves LMs’ reasoning capabilities by first prompting LMs to write code to solve a task, and then, during code execution, using LMs to simulate the execution of code lines that could not be executed. We demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a significant gain of 12% over Chain of Thought. |
Chengshu Li · Jacky Liang · Fei Xia · Andy Zeng · Sergey Levine · Dorsa Sadigh · Karol Hausman · Xinyun Chen · Fei-Fei Li · brian ichter 🔗 |
-
|
Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
(
Poster
)
>
link
We study building an agent that solves diverse long-horizon tasks in open-world environments. Without human demonstrations, learning to accomplish tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills, and propose a Finding-skill to improve the sample efficiency for training all the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills with high success rates. For skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. |
昊琦 袁 · Chi Zhang · Hongcheng Wang · Feiyang Xie · Penglin Cai · Hao Dong · Zongqing Lu 🔗 |
-
|
Skill Reinforcement Learning and Planning for Open-World Long-Horizon Tasks
(
Oral
)
>
link
We study building an agent that solves diverse long-horizon tasks in open-world environments. Without human demonstrations, learning to accomplish tasks in a large open-world environment with reinforcement learning (RL) is extremely inefficient. To tackle this challenge, we convert the multi-task learning problem into learning basic skills and planning over the skills, and propose a Finding-skill to improve the sample efficiency for training all the skills. Using the popular open-world game Minecraft as the testbed, we propose three types of fine-grained basic skills, and use RL with intrinsic rewards to acquire skills with high success rates. For skill planning, we leverage the prior knowledge in Large Language Models to find the relationships between skills and build a skill graph. When the agent is solving a task, our skill search algorithm walks on the skill graph and generates the proper skill plans for the agent. In experiments, our method accomplishes 40 diverse Minecraft tasks, where many tasks require sequentially executing for more than 10 skills. Our method outperforms baselines by a large margin and is the most sample-efficient demonstration-free RL method to solve Minecraft Tech Tree tasks. |
昊琦 袁 · Chi Zhang · Hongcheng Wang · Feiyang Xie · Penglin Cai · Hao Dong · Zongqing Lu 🔗 |
-
|
Language Agents as Digital Representatives in Collective Decision-Making
(
Poster
)
>
link
Consider the process of collective decision-making, in which a group of individuals interactively select a preferred outcome from among a universe of alternatives. In this context, "representation" is the activity of making an individual's preferences present in the process via participation by a proxy agent---i.e. their "representative". To this end, learned models of human behavior have the potential to fill this role, with practical implications for multi-agent scenario studies and mechanism design. In this work, we investigate the possibility of training language agents to behave in the capacity of representatives of human agents, appropriately expressing the preferences of those individuals whom they stand for. First, we formalize the setting of collective decision-making---as the episodic process of interaction between a group of agents and a decision mechanism. On this basis, we then formalize the problem of digital representation---as the simulation of an agent's behavior to yield equivalent outcomes from the mechanism. Finally, we conduct an empirical case study in the setting of consensus-finding among diverse humans, and demonstrate the feasibility of fine-tuning large language models to act as digital representatives. |
Daniel Jarrett · Miruna Pislar · Michael Tessler · Michiel Bakker · Raphael Koster · Jan Balaguer · Romuald Elie · Christopher Summerfield · Andrea Tacchetti 🔗 |
-
|
Language Agents as Digital Representatives in Collective Decision-Making
(
Oral
)
>
link
Consider the process of collective decision-making, in which a group of individuals interactively select a preferred outcome from among a universe of alternatives. In this context, "representation" is the activity of making an individual's preferences present in the process via participation by a proxy agent---i.e. their "representative". To this end, learned models of human behavior have the potential to fill this role, with practical implications for multi-agent scenario studies and mechanism design. In this work, we investigate the possibility of training language agents to behave in the capacity of representatives of human agents, appropriately expressing the preferences of those individuals whom they stand for. First, we formalize the setting of collective decision-making---as the episodic process of interaction between a group of agents and a decision mechanism. On this basis, we then formalize the problem of digital representation---as the simulation of an agent's behavior to yield equivalent outcomes from the mechanism. Finally, we conduct an empirical case study in the setting of consensus-finding among diverse humans, and demonstrate the feasibility of fine-tuning large language models to act as digital representatives. |
Daniel Jarrett · Miruna Pislar · Michael Tessler · Michiel Bakker · Raphael Koster · Jan Balaguer · Romuald Elie · Christopher Summerfield · Andrea Tacchetti 🔗 |
-
|
Selective Perception: Learning Concise State Descriptions for Language Model Actors
(
Poster
)
>
link
It is increasingly common for large language models (LLMs) to be applied as actors in sequential decision making problems in embodied domains such as robotics and games, due to their general world knowledge and planning abilities. However, LLMs are not natively trained for embodied decision making problems, and expressing complex state spaces in text is non-trivial. Exhaustively describing high-dimensional states leads to prohibitive inference costs and impaired task performance due to distracting or irrelevant information. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose BLINDER (Brief Language INputs for DEcision-making Responses), a method for learning to select concise and helpful sets of state features for LLM actors. BLINDER learns a value function for task-conditioned state descriptions that approximates the likelihood that a state description will result in optimal actions. We evaluate BLINDER on the challenging video game NetHack and a real-world robotic manipulation task. Our method improves task success rate by 77% and 14% on NetHack and robotic manipulation respectively, reduces model input length by 83%, and generalizes well to LLM actors of various size and quality. |
Kolby T Nottingham · Yasaman Razeghi · Kyungmin Kim · JB Lanier · Pierre Baldi · Roy Fox · Sameer Singh 🔗 |
-
|
Selective Perception: Learning Concise State Descriptions for Language Model Actors
(
Oral
)
>
link
It is increasingly common for large language models (LLMs) to be applied as actors in sequential decision making problems in embodied domains such as robotics and games, due to their general world knowledge and planning abilities. However, LLMs are not natively trained for embodied decision making problems, and expressing complex state spaces in text is non-trivial. Exhaustively describing high-dimensional states leads to prohibitive inference costs and impaired task performance due to distracting or irrelevant information. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose BLINDER (Brief Language INputs for DEcision-making Responses), a method for learning to select concise and helpful sets of state features for LLM actors. BLINDER learns a value function for task-conditioned state descriptions that approximates the likelihood that a state description will result in optimal actions. We evaluate BLINDER on the challenging video game NetHack and a real-world robotic manipulation task. Our method improves task success rate by 77% and 14% on NetHack and robotic manipulation respectively, reduces model input length by 83%, and generalizes well to LLM actors of various size and quality. |
Kolby T Nottingham · Yasaman Razeghi · Kyungmin Kim · JB Lanier · Pierre Baldi · Roy Fox · Sameer Singh 🔗 |
-
|
LASER: LLM Agent with State-Space Exploration for Web Navigation
(
Poster
)
>
link
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. While achieving decent performance, previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples to teach the model how to reason in the interactive environment. Consequently, the model could not handle more challenging scenarios not covered in the in-context examples, e.g., mistakes, leading to sub-optimal performance. To address this issue, we propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible backtracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task. |
Kaixin Ma · Hongming Zhang · Hongwei Wang · Xiaoman Pan · Dong Yu 🔗 |
-
|
LASER: LLM Agent with State-Space Exploration for Web Navigation
(
Oral
)
>
link
Large language models (LLMs) have been successfully adapted for interactive decision-making tasks like web navigation. While achieving decent performance, previous methods implicitly assume a forward-only execution mode for the model, where they only provide oracle trajectories as in-context examples to teach the model how to reason in the interactive environment. Consequently, the model could not handle more challenging scenarios not covered in the in-context examples, e.g., mistakes, leading to sub-optimal performance. To address this issue, we propose to model the interactive task as state space exploration, where the LLM agent transitions among a pre-defined set of states by performing actions to complete the task. This formulation enables flexible backtracking, allowing the model to easily recover from errors. We evaluate our proposed LLM Agent with State-Space ExploRation (LASER) on the WebShop task. Experimental results show that our LASER agent significantly outperforms previous methods and closes the gap with human performance on the web navigation task. |
Kaixin Ma · Hongming Zhang · Hongwei Wang · Xiaoman Pan · Dong Yu 🔗 |
-
|
Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning
(
Poster
)
>
link
Off-policy dynamic programming (DP) techniques that implement fixed-point iteration, such as $Q$-learning, have proven to be an important technique for solving sequential decision-making problems. However, in the presence of function approximation such algorithms are not guaranteed to converge, often diverging due to the absence of Bellman-completeness in the function classes considered, a crucial condition for the success of DP-based methods. In this paper, we show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow *linearly* with the state space size to satisfy Bellman-completeness while a constant layer width is enough for RCSL. These findings take a step towards explaining the superior empirical performance of RCSL methods compared to DP-based methods in many near-deterministic environments in deep reinforcement learning. Furthermore, in order to learn from sub-optimal datasets, we propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories. MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching while avoiding the need for Bellman completeness that plagues all dynamic programming algorithms. We propose both theoretical analysis and experimental evaluation to back these claims, outperforming state-of-the-art model-free and model-based offline RL algorithms across several simulated robotics problems.
|
Zhaoyi Zhou · Chuning Zhu · Runlong Zhou · Qiwen Cui · Abhishek Gupta · Simon Du 🔗 |
-
|
Free from Bellman Completeness: Trajectory Stitching via Model-based Return-conditioned Supervised Learning
(
Oral
)
>
link
Off-policy dynamic programming (DP) techniques that implement fixed-point iteration, such as $Q$-learning, have proven to be an important technique for solving sequential decision-making problems. However, in the presence of function approximation such algorithms are not guaranteed to converge, often diverging due to the absence of Bellman-completeness in the function classes considered, a crucial condition for the success of DP-based methods. In this paper, we show how off-policy learning techniques based on return-conditioned supervised learning (RCSL) are able to circumvent these challenges of Bellman completeness, converging under significantly more relaxed assumptions inherited from supervised learning. We prove there exists a natural environment in which if one uses two-layer multilayer perceptron as the function approximator, the layer width needs to grow *linearly* with the state space size to satisfy Bellman-completeness while a constant layer width is enough for RCSL. These findings take a step towards explaining the superior empirical performance of RCSL methods compared to DP-based methods in many near-deterministic environments in deep reinforcement learning. Furthermore, in order to learn from sub-optimal datasets, we propose a simple framework called MBRCSL, granting RCSL methods the ability of dynamic programming to stitch together segments from distinct trajectories. MBRCSL leverages learned dynamics models and forward sampling to accomplish trajectory stitching while avoiding the need for Bellman completeness that plagues all dynamic programming algorithms. We propose both theoretical analysis and experimental evaluation to back these claims, outperforming state-of-the-art model-free and model-based offline RL algorithms across several simulated robotics problems.
|
Zhaoyi Zhou · Chuning Zhu · Runlong Zhou · Qiwen Cui · Abhishek Gupta · Simon Du 🔗 |
-
|
Towards End-to-End Embodied Decision Making with Multi-modal Large Language Model
(
Poster
)
>
link
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. The PCA-EVAL benchmark and evaluation guidelines will be open-source. |
Liang Chen · Yichi Zhang · Shuhuai Ren · Haozhe Zhao · Zefan Cai · Yuchi Wang · Tianyu Liu · Baobao Chang 🔗 |
-
|
Towards End-to-End Embodied Decision Making with Multi-modal Large Language Model
(
Oral
)
>
link
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents. While Large Language Models (LLMs) have been widely used due to their advanced reasoning skills and vast world knowledge, MLLMs like GPT4-Vision offer enhanced visual understanding and reasoning capabilities. We investigate whether state-of-the-art MLLMs can handle embodied decision-making in an end-to-end manner and whether collaborations between LLMs and MLLMs can enhance decision-making. To address these questions, we introduce a new benchmark called PCA-EVAL, which evaluates embodied decision-making from the perspectives of Perception, Cognition, and Action. Additionally, we propose HOLMES, a multi-agent cooperation framework that allows LLMs to leverage MLLMs and APIs to gather multimodal information for informed decision-making. We compare end-to-end embodied decision-making and HOLMES on our benchmark and find that the GPT4-Vision model demonstrates strong end-to-end embodied decision-making abilities, outperforming GPT4-HOLMES in terms of average decision accuracy (+3%). However, this performance is exclusive to the latest GPT4-Vision model, surpassing the open-source state-of-the-art MLLM by 26%. Our results indicate that powerful MLLMs like GPT4-Vision hold promise for decision-making in embodied agents, offering new avenues for MLLM research. The PCA-EVAL benchmark and evaluation guidelines will be open-source. |
Liang Chen · Yichi Zhang · Shuhuai Ren · Haozhe Zhao · Zefan Cai · Yuchi Wang · Tianyu Liu · Baobao Chang 🔗 |
-
|
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
(
Poster
)
>
link
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome.For example, a teacher might try to understand their student's current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy.LLMs trained with supervised fine-tuning or ``single-step'' RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue. Our key insight is that, though LLMs might not effectively solve goal-directed dialogue tasks out of the box, they can provide useful data for solving such tasks by simulating suboptimal but human-like behaviors. Given a textual description of a goal-directed dialogue task, we leverage LLMs to sample diverse synthetic rollouts of hypothetical in-domain human-human interactions. Our algorithm then utilizes this dataset with \emph{offline reinforcement learning} to train an interactive conversational agent that can optimize goal-directed objectives over multiple turns. In effect, the LLM produces examples of possible interactions, and RL then processes these examples to learn to perform more optimal interactions. Empirically, we show that our proposed approach achieves state-of-the-art performance in various goal-directed dialogue tasks that include teaching and preference elicitation. |
Joey Hong · Sergey Levine · Anca Dragan 🔗 |
-
|
Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations
(
Oral
)
>
link
Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome.For example, a teacher might try to understand their student's current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy.LLMs trained with supervised fine-tuning or ``single-step'' RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue. Our key insight is that, though LLMs might not effectively solve goal-directed dialogue tasks out of the box, they can provide useful data for solving such tasks by simulating suboptimal but human-like behaviors. Given a textual description of a goal-directed dialogue task, we leverage LLMs to sample diverse synthetic rollouts of hypothetical in-domain human-human interactions. Our algorithm then utilizes this dataset with \emph{offline reinforcement learning} to train an interactive conversational agent that can optimize goal-directed objectives over multiple turns. In effect, the LLM produces examples of possible interactions, and RL then processes these examples to learn to perform more optimal interactions. Empirically, we show that our proposed approach achieves state-of-the-art performance in various goal-directed dialogue tasks that include teaching and preference elicitation. |
Joey Hong · Sergey Levine · Anca Dragan 🔗 |
-
|
PASTA: Pretrained Action-State Transformer Agents
(
Poster
)
>
link
Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pre-trained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning. |
Raphael Boige · Yannis Flet-Berliac · Arthur Flajolet · Guillaume Richard · Thomas PIERROT 🔗 |
-
|
PASTA: Pretrained Action-State Transformer Agents
(
Oral
)
>
link
Self-supervised learning has brought about a revolutionary paradigm shift in various computing domains, including NLP, vision, and biology. Recent approaches involve pre-training transformer models on vast amounts of unlabeled data, serving as a starting point for efficiently solving downstream tasks. In the realm of reinforcement learning, researchers have recently adapted these approaches by developing models pre-trained on expert trajectories, enabling them to address a wide range of tasks, from robotics to recommendation systems. However, existing methods mostly rely on intricate pre-training objectives tailored to specific downstream applications. This paper presents a comprehensive investigation of models we refer to as Pre-trained Action-State Transformer Agents (PASTA). Our study uses a unified methodology and covers an extensive set of general downstream tasks including behavioral cloning, offline RL, sensor failure robustness, and dynamics change adaptation. Our goal is to systematically compare various design choices and provide valuable insights to practitioners for building robust models. Key highlights of our study include tokenization at the action and state component level, using fundamental pre-training objectives like next token prediction, training models across diverse domains simultaneously, and using parameter efficient fine-tuning (PEFT). The developed models in our study contain fewer than 10 million parameters and the application of PEFT enables fine-tuning of fewer than 10,000 parameters during downstream adaptation, allowing a broad community to use these models and reproduce our experiments. We hope that this study will encourage further research into the use of transformers with first-principles design choices to represent RL trajectories and contribute to robust policy learning. |
Raphael Boige · Yannis Flet-Berliac · Arthur Flajolet · Guillaume Richard · Thomas PIERROT 🔗 |
-
|
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
(
Poster
)
>
link
Building agents using large language models (LLMs) to control computers is an emerging research field. In this setting, the agent processes computer states and performs actions to accomplish tasks specified in natural language. Previous computer agents have demonstrated the benefits of in-context learning (ICL), i.e., prompting LLMs with a few exemplars; however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in tasks that require many steps or repeated actions. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent that incorporates trajectory-as-exemplar prompting and exemplar memory. Specifically, Synapse has three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions for improved multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 53% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web. |
Longtao Zheng · Rundong Wang · Xinrun Wang · Bo An 🔗 |
-
|
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
(
Oral
)
>
link
Building agents using large language models (LLMs) to control computers is an emerging research field. In this setting, the agent processes computer states and performs actions to accomplish tasks specified in natural language. Previous computer agents have demonstrated the benefits of in-context learning (ICL), i.e., prompting LLMs with a few exemplars; however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in tasks that require many steps or repeated actions. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent that incorporates trajectory-as-exemplar prompting and exemplar memory. Specifically, Synapse has three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions for improved multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 53% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web. |
Longtao Zheng · Rundong Wang · Xinrun Wang · Bo An 🔗 |
-
|
Building Cooperative Embodied Agents Modularly with Large Language Models
(
Poster
)
>
link
In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://llm-co.github.io/CoELA/ . |
Hongxin Zhang · Weihua Du · Jiaming Shan · Qinhong Zhou · Yilun Du · Josh Tenenbaum · Tianmin Shu · Chuang Gan 🔗 |
-
|
Building Cooperative Embodied Agents Modularly with Large Language Models
(
Oral
)
>
link
In this work, we address challenging multi-agent cooperation problems with decentralized control, raw sensory observations, costly communication, and multi-objective tasks instantiated in various embodied environments. While previous research either presupposes a cost-free communication channel or relies on a centralized controller with shared observations, we harness the commonsense knowledge, reasoning ability, language comprehension, and text generation prowess of LLMs and seamlessly incorporate them into a cognitive-inspired modular framework that integrates with perception, memory, and execution. Thus building a Cooperative Embodied Language Agent CoELA, who can plan, communicate, and cooperate with others to accomplish long-horizon tasks efficiently. Our experiments on C-WAH and TDW-MAT demonstrate that CoELA driven by GPT-4 can surpass strong planning-based methods and exhibit emergent effective communication. Though current Open LMs like LLAMA-2 still underperform, we fine-tune a CoLLAMA with data collected with our agents and show how they can achieve promising performance. We also conducted a user study for human-agent interaction and discovered that CoELA communicating in natural language can earn more trust and cooperate more effectively with humans. Our research underscores the potential of LLMs for future research in multi-agent cooperation. Videos can be found on the project website https://llm-co.github.io/CoELA/ . |
Hongxin Zhang · Weihua Du · Jiaming Shan · Qinhong Zhou · Yilun Du · Josh Tenenbaum · Tianmin Shu · Chuang Gan 🔗 |
-
|
FoMo rewards: Casting foundation models as generic reward functions
(
Poster
)
>
link
We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models. |
Ekdeep S Lubana · Pim de Haan · Taco Cohen · Johann Brehmer 🔗 |
-
|
FoMo rewards: Casting foundation models as generic reward functions
(
Oral
)
>
link
We explore the viability of casting foundation models as generic reward functions for reinforcement learning. To this end, we propose a simple pipeline that interfaces an off-the-shelf vision model with a large language model. Given a trajectory of observations, we infer the likelihood of an instruction describing the task that the user wants an agent to perform. We show that this generic likelihood function exhibits the characteristics ideally expected from a reward function: it associates high values with the desired behaviour and lower values for several similar, but incorrect policies. Overall, our work opens the possibility of designing open-ended agents for interactive tasks via foundation models. |
Ekdeep S Lubana · Pim de Haan · Taco Cohen · Johann Brehmer 🔗 |
-
|
A Universal World Model Learned from Large Scale and Diverse Videos
(
Poster
)
>
link
World models play a crucial role in model-based reinforcement learning (RL) by providing predictive representations of an agent in an environment and enabling the agent to reason about the future and make more informed decisions. However, there are still two main problems limiting the applications of world models. First, current methods typically train the world models using only massive domain-specific data, making it challenging to generalize to unseen scenarios or adapt to changes in the environments. Second, it is difficult to define the actions when world models are trained using in the wild videos. In this work, we tackle these two problems by learning a general purpose world model from a diverse and large scale real world video dataset with extracted latent actions. Specifically, our approach leverages a pre-trained vision encoder to project the images of two adjacent frames into states; then, extracts the latent actions into a low dimensional space based on vector quantization; finally, a dynamic function is learned using latent actions. Results show that the proposed generic world model can successfully extract latent actions of arbitrary neighboring frames when testing on in the wild video dataset. Furthermore, fine-tuning on only a small amount of in-domain data can significantly improve the accuracy of the generic world model when adapting to unseen environments. |
Hanchen Cui · Yang Gao 🔗 |
-
|
A Universal World Model Learned from Large Scale and Diverse Videos
(
Oral
)
>
link
World models play a crucial role in model-based reinforcement learning (RL) by providing predictive representations of an agent in an environment and enabling the agent to reason about the future and make more informed decisions. However, there are still two main problems limiting the applications of world models. First, current methods typically train the world models using only massive domain-specific data, making it challenging to generalize to unseen scenarios or adapt to changes in the environments. Second, it is difficult to define the actions when world models are trained using in the wild videos. In this work, we tackle these two problems by learning a general purpose world model from a diverse and large scale real world video dataset with extracted latent actions. Specifically, our approach leverages a pre-trained vision encoder to project the images of two adjacent frames into states; then, extracts the latent actions into a low dimensional space based on vector quantization; finally, a dynamic function is learned using latent actions. Results show that the proposed generic world model can successfully extract latent actions of arbitrary neighboring frames when testing on in the wild video dataset. Furthermore, fine-tuning on only a small amount of in-domain data can significantly improve the accuracy of the generic world model when adapting to unseen environments. |
Hanchen Cui · Yang Gao 🔗 |
-
|
From Text to Tactic: Evaluating LLMs Playing the Game of Avalon
(
Poster
)
>
link
In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, \textbf{Resistance Avalon}. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compelling test-bed to study the decision-making and language-processing capabilities of LLM Agents.To facilitate research in this line, we introduce \textsc{AvalonBench} - a comprehensive game environment tailored for evaluating multi-agent LLM Agents. This benchmark incorporates: (1) a game environment for Avalon, (2) rule-based bots as baseline opponents, and (3) ReAct-style LLM agents with tailored prompts for each role. Notably, our evaluations based on \textsc{AvalonBench} highlight a clear capability gap. For instance, models like ChatGPT playing good-role got a win rate of 22.2\% against rule-based bots playing evil, while good-role bot achieves 38.2\% win rate in the same setting.We envision \textsc{AvalonBench} could be a good test-bed for developing more advanced LLMs (with self-playing) and agent frameworks that can effectively model the layered complexities of such game environments. |
Jonathan Light · Min Cai · Sheng Shen · Ziniu Hu 🔗 |
-
|
From Text to Tactic: Evaluating LLMs Playing the Game of Avalon
(
Oral
)
>
link
In this paper, we explore the potential of Large Language Models (LLMs) Agents in playing the strategic social deduction game, \textbf{Resistance Avalon}. Players in Avalon are challenged not only to make informed decisions based on dynamically evolving game phases, but also to engage in discussions where they must deceive, deduce, and negotiate with other players. These characteristics make Avalon a compelling test-bed to study the decision-making and language-processing capabilities of LLM Agents.To facilitate research in this line, we introduce \textsc{AvalonBench} - a comprehensive game environment tailored for evaluating multi-agent LLM Agents. This benchmark incorporates: (1) a game environment for Avalon, (2) rule-based bots as baseline opponents, and (3) ReAct-style LLM agents with tailored prompts for each role. Notably, our evaluations based on \textsc{AvalonBench} highlight a clear capability gap. For instance, models like ChatGPT playing good-role got a win rate of 22.2\% against rule-based bots playing evil, while good-role bot achieves 38.2\% win rate in the same setting.We envision \textsc{AvalonBench} could be a good test-bed for developing more advanced LLMs (with self-playing) and agent frameworks that can effectively model the layered complexities of such game environments. |
Jonathan Light · Min Cai · Sheng Shen · Ziniu Hu 🔗 |
-
|
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
(
Poster
)
>
link
Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design.
|
Tianwei Ni · Michel Ma · Benjamin Eysenbach · Pierre-Luc Bacon 🔗 |
-
|
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
(
Oral
)
>
link
Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations $1500$ steps ago. However, Transformers do not improve long-term credit assignment. In summary, our results provide an explanation for the success of Transformers in RL, while also highlighting an important area for future research and benchmark design.
|
Tianwei Ni · Michel Ma · Benjamin Eysenbach · Pierre-Luc Bacon 🔗 |
-
|
Self-Select: Optimizing Instruction Selection for Large Language Models
(
Poster
)
>
link
The same question can often be presented in different ways, depending on the audience and the intent with which it is being posed. To determine whether large language models (LLMs) demonstrate preferences for one phrasing over another regardless of semantic content, we introduce _Self-Select_, a method for selection of a preferred instruction template, and generation of high-quality synthetic data samples. This algorithm makes use of a _meta-prompt_ to decide on an instruction template, given the task and a list of candidate templates; then, the same LLM generates $n$ new samples following the structure of the chosen template. We evaluate _Self-Select_ on numerical reasoning and sentiment classification tasks, using a variety of both instruction-tuned and base models, providing insights into their ability to perform instruction selection, and their respective preferences and biases as a function of the instruction fine-tuning data. Our results demonstrate that stronger models can successfully perform instruction selection, and that large instruction-tuned models are more consistent and stable in their template choices. Furthermore, we find that permuting the instruction template ordering in the prompt leads to vastly different choice distributions, suggesting that decisions may be influenced more by inductive biases than by semantic understanding, even after instruction-tuning.
|
Keshav Ramji · Alexander Kyimpopkin 🔗 |
-
|
Self-Select: Optimizing Instruction Selection for Large Language Models
(
Oral
)
>
link
The same question can often be presented in different ways, depending on the audience and the intent with which it is being posed. To determine whether large language models (LLMs) demonstrate preferences for one phrasing over another regardless of semantic content, we introduce _Self-Select_, a method for selection of a preferred instruction template, and generation of high-quality synthetic data samples. This algorithm makes use of a _meta-prompt_ to decide on an instruction template, given the task and a list of candidate templates; then, the same LLM generates $n$ new samples following the structure of the chosen template. We evaluate _Self-Select_ on numerical reasoning and sentiment classification tasks, using a variety of both instruction-tuned and base models, providing insights into their ability to perform instruction selection, and their respective preferences and biases as a function of the instruction fine-tuning data. Our results demonstrate that stronger models can successfully perform instruction selection, and that large instruction-tuned models are more consistent and stable in their template choices. Furthermore, we find that permuting the instruction template ordering in the prompt leads to vastly different choice distributions, suggesting that decisions may be influenced more by inductive biases than by semantic understanding, even after instruction-tuning.
|
Keshav Ramji · Alexander Kyimpopkin 🔗 |
-
|
Benchmarking Large Language Models as AI Research Agents
(
Poster
)
>
link
Scientific experimentation involves an iterative process of creating hypotheses, designing experiments, running experiments, and analyzing the results. Can we build AI research agents to perform these long-horizon tasks? To take a step towards building and evaluating research agents on such open-ended decision-making tasks, we focus on the problem of machine learning engineering: given a task description and a dataset, build a high-performing model. In this paper, we propose MLAgentBench, a suite of ML tasks for benchmarking AI research agents. Agents can perform actions like reading/writing files, executing code, and inspecting outputs. With these actions, agents could run experiments, analyze the results, and modify the code of entire machine learning pipelines, such as data processing, architecture, training processes, etc. The benchmark then automatically evaluates the agent’s performance objectively over various metrics related to performance and efficiency. We also design an LLM-based research agent to automatically perform experimentation loops in such an environment. Empirically, we find that a GPT-4- based research agent can feasibly build compelling ML models over many tasks in MLAgentBench, displaying highly interpretable plans and actions. However, the success rates vary considerably; they span from almost 90% on well-established older datasets to as low as 10% on recent Kaggle Challenges – unavailable during the LLM model’s pretraining – and even 0% on newer research challenges like BabyLM. Finally, we identify several key challenges for LLM-based research agents such as long-term planning and hallucination. Our code is released at https://anonymous.4open.science/r/MLAgentBench/. |
Qian Huang · Jian Vora · Percy Liang · Jure Leskovec 🔗 |
-
|
Benchmarking Large Language Models as AI Research Agents
(
Oral
)
>
link
Scientific experimentation involves an iterative process of creating hypotheses, designing experiments, running experiments, and analyzing the results. Can we build AI research agents to perform these long-horizon tasks? To take a step towards building and evaluating research agents on such open-ended decision-making tasks, we focus on the problem of machine learning engineering: given a task description and a dataset, build a high-performing model. In this paper, we propose MLAgentBench, a suite of ML tasks for benchmarking AI research agents. Agents can perform actions like reading/writing files, executing code, and inspecting outputs. With these actions, agents could run experiments, analyze the results, and modify the code of entire machine learning pipelines, such as data processing, architecture, training processes, etc. The benchmark then automatically evaluates the agent’s performance objectively over various metrics related to performance and efficiency. We also design an LLM-based research agent to automatically perform experimentation loops in such an environment. Empirically, we find that a GPT-4- based research agent can feasibly build compelling ML models over many tasks in MLAgentBench, displaying highly interpretable plans and actions. However, the success rates vary considerably; they span from almost 90% on well-established older datasets to as low as 10% on recent Kaggle Challenges – unavailable during the LLM model’s pretraining – and even 0% on newer research challenges like BabyLM. Finally, we identify several key challenges for LLM-based research agents such as long-term planning and hallucination. Our code is released at https://anonymous.4open.science/r/MLAgentBench/. |
Qian Huang · Jian Vora · Percy Liang · Jure Leskovec 🔗 |
-
|
The Unsolved Challenges of LLMs in Open-Ended Web Tasks: A Case Study
(
Poster
)
>
link
In this work, we investigate the challenges associated with developing goal-driven AI agents capable of performing open-ended tasks in a web environment using zero-shot learning. Our primary focus is on harnessing the capabilities of large language models (LLMs) in the context of web navigation through HTML-based user interfaces (UIs). We evaluate the MiniWoB benchmark and show that it is a suitable yet challenging platform for assessing an agent's ability to comprehend and solve tasks without prior human demonstrations. Our main contribution encompasses a set of extensive experiments where we compare and contrast various agent design considerations, such as action space, observation space, and the choice of LLM, with the aim of shedding light on the bottlenecks and limitations of LLM-based zero-shot learning in this domain, in order to foster research endeavours in this area. In our empirical analysis, we find that: (1) the code-based action space is notably the most effective; (2) open-source LLMs hold their own as competitive agents for open-ended web tasks when compared to their proprietary counterparts; and (3) using an accessibility-based representation for web pages, despite resulting in some performance loss, emerges as a cost-effective strategy, particularly as web page sizes increase. |
Rim Assouel · Tom Marty · Massimo Caccia · Issam Hadj Laradji · Alexandre Drouin · Sai Rajeswar Mudumba · Hector Palacios · Quentin Cappart · David Vazquez · Nicolas Chapados · Maxime Gasse · Alexandre Lacoste
|
-
|
The Unsolved Challenges of LLMs in Open-Ended Web Tasks: A Case Study
(
Oral
)
>
link
In this work, we investigate the challenges associated with developing goal-driven AI agents capable of performing open-ended tasks in a web environment using zero-shot learning. Our primary focus is on harnessing the capabilities of large language models (LLMs) in the context of web navigation through HTML-based user interfaces (UIs). We evaluate the MiniWoB benchmark and show that it is a suitable yet challenging platform for assessing an agent's ability to comprehend and solve tasks without prior human demonstrations. Our main contribution encompasses a set of extensive experiments where we compare and contrast various agent design considerations, such as action space, observation space, and the choice of LLM, with the aim of shedding light on the bottlenecks and limitations of LLM-based zero-shot learning in this domain, in order to foster research endeavours in this area. In our empirical analysis, we find that: (1) the code-based action space is notably the most effective; (2) open-source LLMs hold their own as competitive agents for open-ended web tasks when compared to their proprietary counterparts; and (3) using an accessibility-based representation for web pages, despite resulting in some performance loss, emerges as a cost-effective strategy, particularly as web page sizes increase. |
Rim Assouel · Tom Marty · Massimo Caccia · Issam Hadj Laradji · Alexandre Drouin · Sai Rajeswar Mudumba · Hector Palacios · Quentin Cappart · David Vazquez · Nicolas Chapados · Maxime Gasse · Alexandre Lacoste
|
-
|
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
(
Poster
)
>
link
The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes effective \emph{pretraining} of large models for decision-making, with little exploration into how to perform data-efficient continual \emph{adaptation} of these models for new tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques---e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA)---in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings. |
ZUXIN LIU · Jesse Zhang · Kavosh Asadi · Yao Liu · DING ZHAO · Shoham Sabach · Rasool Fakoor 🔗 |
-
|
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
(
Oral
)
>
link
The full potential of large pretrained models remains largely untapped in control domains like robotics. This is mainly because of the scarcity of data and the computational challenges associated with training or fine-tuning these large models for such applications. Prior work mainly emphasizes effective \emph{pretraining} of large models for decision-making, with little exploration into how to perform data-efficient continual \emph{adaptation} of these models for new tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters for Imitation Learning), a framework for efficient adaptation to new control tasks. Inspired by recent advancements in parameter-efficient fine-tuning in language domains, we explore efficient fine-tuning techniques---e.g., Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA)---in TAIL to adapt large pretrained models for new tasks with limited demonstration data. Our extensive experiments comparing prevalent parameter-efficient fine-tuning techniques and adaptation baselines suggest that TAIL with LoRA can achieve the best post-adaptation performance with only 1% of the trainable parameters of full fine-tuning, while avoiding catastrophic forgetting and preserving adaptation plasticity in continual learning settings. |
ZUXIN LIU · Jesse Zhang · Kavosh Asadi · Yao Liu · DING ZHAO · Shoham Sabach · Rasool Fakoor 🔗 |
-
|
MMToM-QA: Multimodal Theory of Mind Question Answering
(
Poster
)
>
link
Theory of Mind (ToM), the cognitive ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets -- either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models. |
Chuanyang Jin · Yutong Wu · Jing Cao · Jiannan Xiang · Yen-Ling Kuo · Zhiting Hu · Tomer Ullman · Antonio Torralba · Josh Tenenbaum · Tianmin Shu 🔗 |
-
|
MMToM-QA: Multimodal Theory of Mind Question Answering
(
Oral
)
>
link
Theory of Mind (ToM), the cognitive ability to understand people's minds, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets -- either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data, which can include visual cues, linguistic narratives, or both. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models. |
Chuanyang Jin · Yutong Wu · Jing Cao · Jiannan Xiang · Yen-Ling Kuo · Zhiting Hu · Tomer Ullman · Antonio Torralba · Josh Tenenbaum · Tianmin Shu 🔗 |
-
|
Compositional Foundation Models for Hierarchical Planning
(
Poster
)
>
link
To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model, trained individually on language, vision and action data, jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks. |
Anurag Ajay · Seungwook Han · Yilun Du · Shuang Li · Abhi Gupta · Tommi Jaakkola · Josh Tenenbaum · Leslie Kaelbling · Akash Srivastava · Pulkit Agrawal 🔗 |
-
|
Compositional Foundation Models for Hierarchical Planning
(
Oral
)
>
link
To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model, trained individually on language, vision and action data, jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning within this hierarchy, we enforce consistency between the models via iterative refinement. We illustrate the efficacy and adaptability of our approach in three different long-horizon table-top manipulation tasks. |
Anurag Ajay · Seungwook Han · Yilun Du · Shuang Li · Abhi Gupta · Tommi Jaakkola · Josh Tenenbaum · Leslie Kaelbling · Akash Srivastava · Pulkit Agrawal 🔗 |
-
|
Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception
(
Poster
)
>
link
Recently developed multimodal pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from the pretrained models to construct and verify controllers for sequential decision-making tasks and to ground these controllers to task environments through visual observations. In particular, the algorithm constructs an automaton-based controller that encodes the task-relevant knowledge extracted from the pretrained model. It then verifies whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. If this verification step discovers any inconsistency, the algorithm automatically refines the controller to resolve the inconsistency. Next, the algorithm leverages the vision and language capabilities of pretrained models to ground the controller to the task environment. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks. |
Yunhao Yang · Cyrus Neary · Ufuk Topcu 🔗 |
-
|
Multimodal Pretrained Models for Verifiable Sequential Decision-Making: Planning, Grounding, and Perception
(
Oral
)
>
link
Recently developed multimodal pretrained models can encode rich world knowledge expressed in multiple modalities, such as text and images. However, the outputs of these models cannot be integrated into algorithms to solve sequential decision-making tasks. We develop an algorithm that utilizes the knowledge from the pretrained models to construct and verify controllers for sequential decision-making tasks and to ground these controllers to task environments through visual observations. In particular, the algorithm constructs an automaton-based controller that encodes the task-relevant knowledge extracted from the pretrained model. It then verifies whether the knowledge encoded in the controller is consistent with other independently available knowledge, which may include abstract information on the environment or user-provided specifications. If this verification step discovers any inconsistency, the algorithm automatically refines the controller to resolve the inconsistency. Next, the algorithm leverages the vision and language capabilities of pretrained models to ground the controller to the task environment. We demonstrate the algorithm's ability to construct, verify, and ground automaton-based controllers through a suite of real-world tasks. |
Yunhao Yang · Cyrus Neary · Ufuk Topcu 🔗 |
-
|
RF-POLICY: Rectified Flows are Computation-Adaptive Decision Makers
(
Poster
)
>
link
Diffusion-based imitation learning improves Behavioral Cloning (BC) on multi-modal decision-making but comes at the cost of significantly slower inference due to the recursion in the diffusion process. However, in real-world scenarios, states that require multi-modal decision-making are rare, and the huge consumption of diffusion models is not necessary for most cases. It inspires us to design efficient policy generators that can wisely allocate computation for different contexts. To address this challenge, we propose RF-POLICY (Rectified Flow-Policy), an imitation learning algorithm based on Rectified Flow, a recent advancement in flow-based generative modeling~\citep{liu2022flow}. RF-POLICY adopts probability flow ordinary differential equations (ODEs) for diverse policy generation, with the learning principle of following straight trajectories as much as possible. We uncover and leverage a surprisingly intriguing advantage of these flow-based models over previous diffusion models: their training objective indicates the uncertainty of a certain state, andwhen the state is uni-modal, they automatically reduce to one-step generators since the probability flows admit straight lines.Therefore,RF-POLICY is naturally an adaptive decision maker, offering rapid inference without sacrificing diversity.Our comprehensive empirical evaluation shows that RF-POLICY, to the best of our knowledge, is the first algorithm to achieve high performance across all dimensions, including success rate, behavioral diversity, and inference speed. |
Xixi Hu · Bo Liu · Xingchao Liu · Qiang Liu 🔗 |
-
|
RF-POLICY: Rectified Flows are Computation-Adaptive Decision Makers
(
Oral
)
>
link
Diffusion-based imitation learning improves Behavioral Cloning (BC) on multi-modal decision-making but comes at the cost of significantly slower inference due to the recursion in the diffusion process. However, in real-world scenarios, states that require multi-modal decision-making are rare, and the huge consumption of diffusion models is not necessary for most cases. It inspires us to design efficient policy generators that can wisely allocate computation for different contexts. To address this challenge, we propose RF-POLICY (Rectified Flow-Policy), an imitation learning algorithm based on Rectified Flow, a recent advancement in flow-based generative modeling~\citep{liu2022flow}. RF-POLICY adopts probability flow ordinary differential equations (ODEs) for diverse policy generation, with the learning principle of following straight trajectories as much as possible. We uncover and leverage a surprisingly intriguing advantage of these flow-based models over previous diffusion models: their training objective indicates the uncertainty of a certain state, andwhen the state is uni-modal, they automatically reduce to one-step generators since the probability flows admit straight lines.Therefore,RF-POLICY is naturally an adaptive decision maker, offering rapid inference without sacrificing diversity.Our comprehensive empirical evaluation shows that RF-POLICY, to the best of our knowledge, is the first algorithm to achieve high performance across all dimensions, including success rate, behavioral diversity, and inference speed. |
Xixi Hu · Bo Liu · Xingchao Liu · Qiang Liu 🔗 |
-
|
Creative Robot Tool Use with Large Language Models
(
Poster
)
>
link
Tool use is a hallmark of advanced intelligence, exemplified in both animal behavior and robotic capabilities. This paper investigates the feasibility of imbuing robots with the ability to creatively use tools in tasks that involve implicit physical constraints and long-term planning. Leveraging Large Language Models (LLMs), we develop RoboTool, a system that accepts natural language instructions and outputs executable code for controlling robots in both simulated and real-world environments. RoboTool incorporates four pivotal components: (i) an “Analyzer” that interprets natural language to discern key task-related concepts, (ii) a “Planner” that generates comprehensive strategies based on the language input and key concepts, (iii) a “Calculator” that computes parameters for each skill, and (iv) a “Coder” that translates these plans into executable Python code. Our results show that RoboTool can not only comprehend implicit physical constraints and environmental factors but also demonstrate creative tool use. Unlike traditional Task and Motion Planning (TAMP) methods that rely on explicit optimization and are confined to formal logic, our LLM-based system offers a more flexible, efficient, and user-friendly solution for complex robotics tasks. Through extensive experiments, we validate that RoboTool is proficient in handling tasks that would otherwise be infeasible without the creative use of tools, thereby expanding the capabilities of robotic systems. |
Mengdi Xu · Wenhao Yu · Peide Huang · Shiqi Liu · Xilun Zhang · Yaru Niu · Tingnan Zhang · Fei Xia · Jie Tan · DING ZHAO 🔗 |
-
|
Creative Robot Tool Use with Large Language Models
(
Oral
)
>
link
Tool use is a hallmark of advanced intelligence, exemplified in both animal behavior and robotic capabilities. This paper investigates the feasibility of imbuing robots with the ability to creatively use tools in tasks that involve implicit physical constraints and long-term planning. Leveraging Large Language Models (LLMs), we develop RoboTool, a system that accepts natural language instructions and outputs executable code for controlling robots in both simulated and real-world environments. RoboTool incorporates four pivotal components: (i) an “Analyzer” that interprets natural language to discern key task-related concepts, (ii) a “Planner” that generates comprehensive strategies based on the language input and key concepts, (iii) a “Calculator” that computes parameters for each skill, and (iv) a “Coder” that translates these plans into executable Python code. Our results show that RoboTool can not only comprehend implicit physical constraints and environmental factors but also demonstrate creative tool use. Unlike traditional Task and Motion Planning (TAMP) methods that rely on explicit optimization and are confined to formal logic, our LLM-based system offers a more flexible, efficient, and user-friendly solution for complex robotics tasks. Through extensive experiments, we validate that RoboTool is proficient in handling tasks that would otherwise be infeasible without the creative use of tools, thereby expanding the capabilities of robotic systems. |
Mengdi Xu · Wenhao Yu · Peide Huang · Shiqi Liu · Xilun Zhang · Yaru Niu · Tingnan Zhang · Fei Xia · Jie Tan · DING ZHAO 🔗 |
-
|
Investigating the Effectiveness of Self-critiquing in LLMs solving Planning Tasks
(
Poster
)
>
link
There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that LLMs, when used as verifiers, produce a notable number of false positives, compromising system reliability. Additionally, self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers. The nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs as verifiers in an iterative, self-critiquing framework for planning tasks. |
Karthik Valmeekam · Matthew Marquez · Subbarao Kambhampati 🔗 |
-
|
Investigating the Effectiveness of Self-critiquing in LLMs solving Planning Tasks
(
Oral
)
>
link
There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that LLMs, when used as verifiers, produce a notable number of false positives, compromising system reliability. Additionally, self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers. The nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs as verifiers in an iterative, self-critiquing framework for planning tasks. |
Karthik Valmeekam · Matthew Marquez · Subbarao Kambhampati 🔗 |
-
|
Capture the Flag: Uncovering Data Insights with Large Language Models
(
Poster
)
>
link
The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making.However, accomplishing this task requires considerable technical skills, domain expertise, and human labor.This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a ``capture the flag'' principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community. |
Issam Hadj Laradji · Perouz Taslakian · Sai Rajeswar Mudumba · Valentina Zantedeschi · Alexandre Lacoste · Nicolas Chapados · David Vazquez · Chris Pal · Alexandre Drouin 🔗 |
-
|
Capture the Flag: Uncovering Data Insights with Large Language Models
(
Oral
)
>
link
The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making.However, accomplishing this task requires considerable technical skills, domain expertise, and human labor.This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a ``capture the flag'' principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community. |
Issam Hadj Laradji · Perouz Taslakian · Sai Rajeswar Mudumba · Valentina Zantedeschi · Alexandre Lacoste · Nicolas Chapados · David Vazquez · Chris Pal · Alexandre Drouin 🔗 |
-
|
$S^2AC$: ENERGY-BASED REINFORCEMENT LEARNING WITH STEIN SOFT ACTOR CRITIC
(
Poster
)
>
link
Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic ($S^2AC$), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, $S^2AC$ uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that $S^2AC$ yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: \url{https://anonymous.4open.science/r/Stein-Soft-Actor-Critic/}
|
Safa Messaoud · Billel Mokeddem · Zhenghai Xue · Bo An · Haipeng Chen · Sanjay Chawla 🔗 |
-
|
$S^2AC$: ENERGY-BASED REINFORCEMENT LEARNING WITH STEIN SOFT ACTOR CRITIC
(
Oral
)
>
link
Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic ($S^2AC$), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, $S^2AC$ uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that $S^2AC$ yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: \url{https://anonymous.4open.science/r/Stein-Soft-Actor-Critic/}
|
Safa Messaoud · Billel Mokeddem · Zhenghai Xue · Bo An · Haipeng Chen · Sanjay Chawla 🔗 |
-
|
Ring Attention with Blockwise Transformers for Near-Infinite Context
(
Poster
)
>
link
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
Ring Attention with Blockwise Transformers for Near-Infinite Context
(
Oral
)
>
link
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
GPT4GEO: How a Language Model Sees the World’s Geography
(
Poster
)
>
link
Large language models (LLMs) have shown remarkable capabilities across a broad range of tasks involving question answering and the generation of coherent text and code. Comprehensively understanding the strengths and weaknesses of LLMs is beneficial for safety, downstream applications and improving performance. In this work, we investigate the degree to which GPT-4 has acquired factual geographic knowledge and is capable of using this knowledge for interpretative reasoning, which is especially important for applications that involve geographic data, such as geospatial analysis, supply chain management, and disaster response. To this end, we design and conduct a series of diverse experiments, starting from factual tasks such as location, distance and elevation estimation to more complex questions such as generating country outlines and travel networks, route finding under constraints and supply chain analysis. We provide a broad characterisation of what GPT-4 knows about the world, highlighting promising and potentially surprising capabilities but also limitations. |
Jonathan Roberts · Timo Lüddecke · Sowmen Das · Kai Han · Samuel Albanie 🔗 |
-
|
GPT4GEO: How a Language Model Sees the World’s Geography
(
Oral
)
>
link
Large language models (LLMs) have shown remarkable capabilities across a broad range of tasks involving question answering and the generation of coherent text and code. Comprehensively understanding the strengths and weaknesses of LLMs is beneficial for safety, downstream applications and improving performance. In this work, we investigate the degree to which GPT-4 has acquired factual geographic knowledge and is capable of using this knowledge for interpretative reasoning, which is especially important for applications that involve geographic data, such as geospatial analysis, supply chain management, and disaster response. To this end, we design and conduct a series of diverse experiments, starting from factual tasks such as location, distance and elevation estimation to more complex questions such as generating country outlines and travel networks, route finding under constraints and supply chain analysis. We provide a broad characterisation of what GPT-4 knows about the world, highlighting promising and potentially surprising capabilities but also limitations. |
Jonathan Roberts · Timo Lüddecke · Sowmen Das · Kai Han · Samuel Albanie 🔗 |
-
|
Vision-and-Language Navigation in Real World using Foundation Models
(
Poster
)
>
link
When mobile robots become ubiquitous, they occasionally encounter unseen environments. Enhancing mobile robots with the ability to follow language instructions will improve decision-making efficiency in previously unseen scenarios. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex real world. Directly transferring SOTA navigation policies learned in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world, utilizing the powerful foundation models. Specifically, the proposed framework includes four key components: (1) a large language models (LLMs) based instruction parser that converts a language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a spatial and semantic map of the unseen environment using large visual-language models (VLMs), (3) a language indexing-based localizer that grounds each macro-action description to a waypoint location on the map, and (4) a pre-trained DD-PPO-based local controller that predicts the action. Evaluated on an Interbotix LoCoBot WX250 in an unseen lab environment, without any fine-tuning, our framework significantly outperforms the SOTA VLN baseline in the real world. |
Chengguang Xu · Hieu T. Nguyen · Christopher Amato · Lawson Wong 🔗 |
-
|
Vision-and-Language Navigation in Real World using Foundation Models
(
Oral
)
>
link
When mobile robots become ubiquitous, they occasionally encounter unseen environments. Enhancing mobile robots with the ability to follow language instructions will improve decision-making efficiency in previously unseen scenarios. However, state-of-the-art (SOTA) vision-and-language navigation (VLN) methods are mainly evaluated in simulation, neglecting the complex real world. Directly transferring SOTA navigation policies learned in simulation to the real world is challenging due to the visual domain gap and the absence of prior knowledge about unseen environments. In this work, we propose a novel navigation framework to address the VLN task in the real world, utilizing the powerful foundation models. Specifically, the proposed framework includes four key components: (1) a large language models (LLMs) based instruction parser that converts a language instruction into a sequence of pre-defined macro-action descriptions, (2) an online visual-language mapper that builds a spatial and semantic map of the unseen environment using large visual-language models (VLMs), (3) a language indexing-based localizer that grounds each macro-action description to a waypoint location on the map, and (4) a pre-trained DD-PPO-based local controller that predicts the action. Evaluated on an Interbotix LoCoBot WX250 in an unseen lab environment, without any fine-tuning, our framework significantly outperforms the SOTA VLN baseline in the real world. |
Chengguang Xu · Hieu T. Nguyen · Christopher Amato · Lawson Wong 🔗 |
-
|
D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
(
Poster
)
>
link
Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields — dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.
|
Yixuan Wang · Zhuoran Li · Mingtong Zhang · Katherine Driggs-Campbell · Jiajun Wu · Fei-Fei Li · Yunzhu Li 🔗 |
-
|
D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation
(
Oral
)
>
link
Scene representation has been a crucial design choice in robotic manipulation systems. An ideal representation should be 3D, dynamic, and semantic to meet the demands of diverse manipulation tasks. However, previous works often lack all three properties simultaneously. In this work, we introduce D$^3$Fields — dynamic 3D descriptor fields. These fields capture the dynamics of the underlying 3D environment and encode both semantic features and instance masks. Specifically, we project arbitrary 3D points in the workspace onto multi-view 2D visual observations and interpolate features derived from foundational models. The resulting fused descriptor fields allow for flexible goal specifications using 2D images with varied contexts, styles, and instances. To evaluate the effectiveness of these descriptor fields, we apply our representation to a wide range of robotic manipulation tasks in a zero-shot manner. Through extensive evaluation in both real-world scenarios and simulations, we demonstrate that D$^3$Fields are both generalizable and effective for zero-shot robotic manipulation tasks. In quantitative comparisons with state-of-the-art dense descriptors, such as Dense Object Nets and DINO, D$^3$Fields exhibit significantly better generalization abilities and manipulation accuracy.
|
Yixuan Wang · Zhuoran Li · Mingtong Zhang · Katherine Driggs-Campbell · Jiajun Wu · Fei-Fei Li · Yunzhu Li 🔗 |
-
|
Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
(
Poster
)
>
link
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. The project website can be found at https://subgoal-image-editing.github.io |
Kevin Black · Mitsuhiko Nakamoto · Pranav Atreya · Homer Walke · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
Zero-Shot Robotic Manipulation with Pre-Trained Image-Editing Diffusion Models
(
Oral
)
>
link
If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller attains. Specifically, we fine-tune InstructPix2Pix on robot data such that it outputs a hypothetical future observation given the robot's current observation and a language command. We then use the same robot data to train a low-level goal-conditioned policy to reach a given image observation. We find that when these components are combined, the resulting system exhibits robust generalization capabilities. The high-level planner utilizes its Internet-scale pre-training and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization than conventional language-conditioned policies. We demonstrate that this approach solves real robot control tasks involving novel objects, distractors, and even environments, both in the real world and in simulation. The project website can be found at https://subgoal-image-editing.github.io |
Kevin Black · Mitsuhiko Nakamoto · Pranav Atreya · Homer Walke · Chelsea Finn · Aviral Kumar · Sergey Levine 🔗 |
-
|
On the Tool Manipulation Capability of Open-sourced Large Language Models
(
Poster
)
>
link
Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 94% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision. |
Qiantong Xu · Fenglu Hong · Bo Li · Changran Hu · Zhengyu Chen · Jian Zhang 🔗 |
-
|
On the Tool Manipulation Capability of Open-sourced Large Language Models
(
Oral
)
>
link
Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 94% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision. |
Qiantong Xu · Fenglu Hong · Bo Li · Changran Hu · Zhengyu Chen · Jian Zhang 🔗 |
-
|
CodePlan: Repository-level Coding using LLMs and Planning
(
Poster
)
>
link
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. While Large Language Models (LLMs) have shown impressive abilities in localized coding tasks, performing interdependent edits across a repository requires multi-step reasoning and planning abilities.We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan.Our framework leverages static analysis techniques to discover dependencies throughout the repository, which are utilised in providing sufficient context to the LLM along with determining the sequence of edits required to solve the repository-level task.We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python) across multiple repositories.Our results demonstrate CodePlan consistently beats baselines across tasks.Further qualitative analysis is performed to highlight how different components of the approach contribute in guiding the LLM towards the correct edits as well as maintaining the consistency of the repository. |
Ramakrishna Bairi · Atharv Sonwane · Aditya Kanade · Vageesh C · Arun Iyer · Suresh Parthasarathy · Sriram Rajamani · B. Ashok · Shashank Shet 🔗 |
-
|
CodePlan: Repository-level Coding using LLMs and Planning
(
Oral
)
>
link
Software engineering activities such as package migration, fixing error reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code. While Large Language Models (LLMs) have shown impressive abilities in localized coding tasks, performing interdependent edits across a repository requires multi-step reasoning and planning abilities.We frame repository-level coding as a planning problem and present a task-agnostic, neuro-symbolic framework called CodePlan.Our framework leverages static analysis techniques to discover dependencies throughout the repository, which are utilised in providing sufficient context to the LLM along with determining the sequence of edits required to solve the repository-level task.We evaluate the effectiveness of CodePlan on two repository-level tasks: package migration (C#) and temporal code edits (Python) across multiple repositories.Our results demonstrate CodePlan consistently beats baselines across tasks.Further qualitative analysis is performed to highlight how different components of the approach contribute in guiding the LLM towards the correct edits as well as maintaining the consistency of the repository. |
Ramakrishna Bairi · Atharv Sonwane · Aditya Kanade · Vageesh C · Arun Iyer · Suresh Parthasarathy · Sriram Rajamani · B. Ashok · Shashank Shet 🔗 |
-
|
Exploration with Principles for Diverse AI Supervision
(
Poster
)
>
link
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from the principles of unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing large language models to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
Exploration with Principles for Diverse AI Supervision
(
Oral
)
>
link
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from the principles of unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing large language models to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
Reason for Future, Act for Now: A Principled Architecture for Autonomous LLM Agents
(
Poster
)
>
link
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (RAFA). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt T$ regret. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks. By incorporating "classical" MDP techniques, RAFA introduces the first autonomous LLM agent with provable regret guarantees.
|
Zhihan Liu · Hao Hu · Shenao Zhang · Hongyi Guo · Shuqi Ke · Boyi Liu · Zhaoran Wang 🔗 |
-
|
Reason for Future, Act for Now: A Principled Architecture for Autonomous LLM Agents
(
Oral
)
>
link
Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (RAFA). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt T$ regret. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks. By incorporating "classical" MDP techniques, RAFA introduces the first autonomous LLM agent with provable regret guarantees.
|
Zhihan Liu · Hao Hu · Shenao Zhang · Hongyi Guo · Shuqi Ke · Boyi Liu · Zhaoran Wang 🔗 |
-
|
AdaPlanner: Adaptive Planning from Feedback with Language Models
(
Poster
)
>
link
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. However, most existing methods either take actions greedily without planning or rely on static plans that are not adaptable to environmental feedback. Consequently, the sequential decision-making performance of LLM agents degenerates with problem complexity and plan horizons increase. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. In AdaPlanner, the LLM agent adaptively refines its plan from feedback with both in-plan and out-of-plan refinement strategies. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities. Furthermore, we propose a skill discovery mechanism that leverages successful plans as few-shot exemplars, enabling the agent to plan and refine with fewer task demonstrations. Our experiments in the ALFWorld and MiniWoB++ environments demonstrate that AdaPlanner outperforms state-of-the-art baselines by 3.73% and 4.11% while utilizing 2x and 600x fewer samples, respectively. |
Haotian Sun · Yuchen Zhuang · Lingkai Kong · Bo Dai · Chao Zhang 🔗 |
-
|
AdaPlanner: Adaptive Planning from Feedback with Language Models
(
Oral
)
>
link
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks. However, most existing methods either take actions greedily without planning or rely on static plans that are not adaptable to environmental feedback. Consequently, the sequential decision-making performance of LLM agents degenerates with problem complexity and plan horizons increase. We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback. In AdaPlanner, the LLM agent adaptively refines its plan from feedback with both in-plan and out-of-plan refinement strategies. To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities. Furthermore, we propose a skill discovery mechanism that leverages successful plans as few-shot exemplars, enabling the agent to plan and refine with fewer task demonstrations. Our experiments in the ALFWorld and MiniWoB++ environments demonstrate that AdaPlanner outperforms state-of-the-art baselines by 3.73% and 4.11% while utilizing 2x and 600x fewer samples, respectively. |
Haotian Sun · Yuchen Zhuang · Lingkai Kong · Bo Dai · Chao Zhang 🔗 |
-
|
Language Model Agents Suffer from Compositional Decision Making
(
Poster
)
>
link
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.0%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment. |
Hiroki Furuta · Yutaka Matsuo · Aleksandra Faust · Izzeddin Gur 🔗 |
-
|
Language Model Agents Suffer from Compositional Decision Making
(
Oral
)
>
link
Language model agents (LMA) recently emerged as a promising paradigm on muti-step decision making tasks, often outperforming humans and other reinforcement learning agents. Despite the promise, their performance on real-world applications that often involve combinations of tasks is still underexplored. In this work, we introduce a new benchmark, called CompWoB -- 50 new compositional web automation tasks reflecting more realistic assumptions. We show that while existing prompted LMAs (gpt-3.5-turbo or gpt-4) achieve 94.0% average success rate on base tasks, their performance degrades to 24.9% success rate on compositional tasks. On the other hand, transferred LMAs (finetuned only on base tasks) show less generalization gap, dropping from 85.4% to 54.8%. By balancing data distribution across tasks, we train a new model, HTML-T5++, that surpasses human-level performance (95.2%) on MiniWoB, and achieves the best zero-shot performance on CompWoB (61.0%). While these highlight the promise of small-scale finetuned and transferred models for compositional generalization, their performance further degrades under different instruction compositions changing combinational order. In contrast to the recent remarkable success of LMA, our benchmark and detailed analysis emphasize the necessity of building LMAs that are robust and generalizable to task compositionality for real-world deployment. |
Hiroki Furuta · Yutaka Matsuo · Aleksandra Faust · Izzeddin Gur 🔗 |
-
|
Semantically-Driven Object Search Using Partially Observed 3D Scene Graphs
(
Poster
)
>
link
Object search is a fundamental task for service robots aiding humans in their daily lives. For example, a robot must locate a cup before pouring coffee, or locate a sponge before cleaning up a spill. As such, robots performing object search across many different and potentially unseen environments must reason about uncertainty in both environment layout and object location. In this work, we frame object search as a Partially Observable Markov Decision Process (POMDP), and propose a generalizable planner that combines the structured representations afforded by 3D scene graphs with the semantic knowledge of language models. Specifically, we introduce (i) 3DSG-POMDPs, which are POMDPs defined over 3D scene graphs that reduce the dimensionality of object search, and (ii) PROPHE-C, a sampling-based planner for solving 3DSG-POMDPS. We demonstrate the efficacy of PROPHE-C in a partially observable household environment, revealing that additional online inference leads to more efficient and exploratory search plans, compared to solely relying on language models for decision-making. |
Isaac Remy · Abhishek Gupta · Karen Leung 🔗 |
-
|
Semantically-Driven Object Search Using Partially Observed 3D Scene Graphs
(
Oral
)
>
link
Object search is a fundamental task for service robots aiding humans in their daily lives. For example, a robot must locate a cup before pouring coffee, or locate a sponge before cleaning up a spill. As such, robots performing object search across many different and potentially unseen environments must reason about uncertainty in both environment layout and object location. In this work, we frame object search as a Partially Observable Markov Decision Process (POMDP), and propose a generalizable planner that combines the structured representations afforded by 3D scene graphs with the semantic knowledge of language models. Specifically, we introduce (i) 3DSG-POMDPs, which are POMDPs defined over 3D scene graphs that reduce the dimensionality of object search, and (ii) PROPHE-C, a sampling-based planner for solving 3DSG-POMDPS. We demonstrate the efficacy of PROPHE-C in a partially observable household environment, revealing that additional online inference leads to more efficient and exploratory search plans, compared to solely relying on language models for decision-making. |
Isaac Remy · Abhishek Gupta · Karen Leung 🔗 |
-
|
Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking
(
Poster
)
>
link
Large language models (LLMs) have shown the ability to solve complex decision-making tasks beyond the natural language processing tasks. Current LLM agents such as ReAct can solve interactive decision-making tasks by imitating the few-shot demonstrations given in the prompt. The LLM agents based on few-shot in-context learning (ICL) achieve surprisingly high performance without training. Despite the simplicity and generalizability, the ICL-based approaches lack optimizing trajectories based on the reward from an environment. In this paper, we introduce Prospector, a reflective LLM agent that features Self-Asking and Trajectory Ranking. To elicit the LLM agent to generate more proper actions that contribute to following a given instruction, we introduce additional Self-Asking steps in the few-shot demonstrations. Furthermore, to take advantages of the stochastic generation of LLMs, we provide Trajectory Ranking in which the LLM agent generates diverse (creative) trajectories and the most rewarding trajectory is selected by using the reward prediction models. On the representative decision-making benchmark environments such as ALFWorld and WebShop, we empirically demonstrate that Prospector can considerably increase the success rate of given tasks, while outperforming recent advancements such as ReAct and Reflexion. |
Byoungjip Kim · Youngsoo Jang · Lajanugen Logeswaran · Geon-Hyeong Kim · Yu Jin Kim · Honglak Lee · Moontae Lee 🔗 |
-
|
Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking
(
Oral
)
>
link
Large language models (LLMs) have shown the ability to solve complex decision-making tasks beyond the natural language processing tasks. Current LLM agents such as ReAct can solve interactive decision-making tasks by imitating the few-shot demonstrations given in the prompt. The LLM agents based on few-shot in-context learning (ICL) achieve surprisingly high performance without training. Despite the simplicity and generalizability, the ICL-based approaches lack optimizing trajectories based on the reward from an environment. In this paper, we introduce Prospector, a reflective LLM agent that features Self-Asking and Trajectory Ranking. To elicit the LLM agent to generate more proper actions that contribute to following a given instruction, we introduce additional Self-Asking steps in the few-shot demonstrations. Furthermore, to take advantages of the stochastic generation of LLMs, we provide Trajectory Ranking in which the LLM agent generates diverse (creative) trajectories and the most rewarding trajectory is selected by using the reward prediction models. On the representative decision-making benchmark environments such as ALFWorld and WebShop, we empirically demonstrate that Prospector can considerably increase the success rate of given tasks, while outperforming recent advancements such as ReAct and Reflexion. |
Byoungjip Kim · Youngsoo Jang · Lajanugen Logeswaran · Geon-Hyeong Kim · Yu Jin Kim · Honglak Lee · Moontae Lee 🔗 |
-
|
$\texttt{PREMIER-TACO}$ is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss
(
Poster
)
>
link
We introduce $\texttt{Premier-TACO}$, a novel multitask feature representation learning methodology aiming to enhance the efficiency of few-shot policy learning in sequential decision-making tasks. $\texttt{Premier-TACO}$ pretrains a general feature representation using a small subset of relevant multitask offline datasets, capturing essential environmental dynamics. This representation can then be fine-tuned to specific tasks with few expert demonstrations. Building upon the recent temporal action contrastive learning (TACO) objective, which obtains the state of art performance in visual control tasks, $\texttt{Premier-TACO}$ additionally employs a simple yet effective negative example sampling strategy. This key modification ensures computational efficiency and scalability for large-scale multitask offline pretraining. Experimental results from both Deepmind Control Suite and MetaWorld domains underscore the effectiveness of $\texttt{Premier-TACO}$ for pretraining visual representation, facilitating efficient few-shot imitation learning of unseen tasks. On the DeepMind Control Suite, $\texttt{Premier-TACO}$ achieves an average improvement of 101% in comparison to a carefully implemented Learn-from-scratch baseline, and a 24% improvement compared with the most effective baseline pretraining method. Similarly, on MetaWorld, $\texttt{Premier-TACO}$ obtains an average advancement of 74% against Learn-from-scratch and a 40% increase in comparison to the best baseline pretraining method.
|
Ruijie Zheng · Yongyuan Liang · Xiyao Wang · Shuang Ma · Hal Daumé III · Huazhe Xu · John Langford · Praveen Palanisamy · Kalyan Basu · Furong Huang 🔗 |
-
|
$\texttt{PREMIER-TACO}$ is a Few-Shot Policy Learner: Pretraining Multitask Representation via Temporal Action-Driven Contrastive Loss
(
Oral
)
>
link
We introduce $\texttt{Premier-TACO}$, a novel multitask feature representation learning methodology aiming to enhance the efficiency of few-shot policy learning in sequential decision-making tasks. $\texttt{Premier-TACO}$ pretrains a general feature representation using a small subset of relevant multitask offline datasets, capturing essential environmental dynamics. This representation can then be fine-tuned to specific tasks with few expert demonstrations. Building upon the recent temporal action contrastive learning (TACO) objective, which obtains the state of art performance in visual control tasks, $\texttt{Premier-TACO}$ additionally employs a simple yet effective negative example sampling strategy. This key modification ensures computational efficiency and scalability for large-scale multitask offline pretraining. Experimental results from both Deepmind Control Suite and MetaWorld domains underscore the effectiveness of $\texttt{Premier-TACO}$ for pretraining visual representation, facilitating efficient few-shot imitation learning of unseen tasks. On the DeepMind Control Suite, $\texttt{Premier-TACO}$ achieves an average improvement of 101% in comparison to a carefully implemented Learn-from-scratch baseline, and a 24% improvement compared with the most effective baseline pretraining method. Similarly, on MetaWorld, $\texttt{Premier-TACO}$ obtains an average advancement of 74% against Learn-from-scratch and a 40% increase in comparison to the best baseline pretraining method.
|
Ruijie Zheng · Yongyuan Liang · Xiyao Wang · Shuang Ma · Hal Daumé III · Huazhe Xu · John Langford · Praveen Palanisamy · Kalyan Basu · Furong Huang 🔗 |
-
|
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
(
Poster
)
>
link
Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL is capable of solving 20+ challenging single and multi-stage robotics tasks on four benchmarks at success rates of over 80% from raw visual input, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://planseqlearn.github.io/ |
Murtaza Dalal · Tarun Chiruvolu · Devendra Singh Chaplot · Russ Salakhutdinov 🔗 |
-
|
Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks
(
Oral
)
>
link
Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL is capable of solving 20+ challenging single and multi-stage robotics tasks on four benchmarks at success rates of over 80% from raw visual input, out-performing language-based, classical, and end-to-end approaches. Video results and code at https://planseqlearn.github.io/ |
Murtaza Dalal · Tarun Chiruvolu · Devendra Singh Chaplot · Russ Salakhutdinov 🔗 |
-
|
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
(
Poster
)
>
link
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call **Metric-Aware Abstraction (METRA)**. Our main idea is, instead of directly covering the state space, to only cover a compact latent space $\mathcal{Z}$ that is *metrically* connected to the state space $\mathcal{S}$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the *first* unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and video are available at https://sites.google.com/view/metra0
|
Seohong Park · Oleh Rybkin · Sergey Levine 🔗 |
-
|
METRA: Scalable Unsupervised RL with Metric-Aware Abstraction
(
Oral
)
>
link
Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call **Metric-Aware Abstraction (METRA)**. Our main idea is, instead of directly covering the state space, to only cover a compact latent space $\mathcal{Z}$ that is *metrically* connected to the state space $\mathcal{S}$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the *first* unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and video are available at https://sites.google.com/view/metra0
|
Seohong Park · Oleh Rybkin · Sergey Levine 🔗 |
-
|
Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making
(
Poster
)
>
link
The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability. |
Jeonghye Kim · Suyoung Lee · Woojun Kim · Youngchul Sung 🔗 |
-
|
Decision ConvFormer: Local Filtering in MetaFormer is Sufficient for Decision Making
(
Oral
)
>
link
The recent success of Transformer in natural language processing has sparked its use in various domains. In offline reinforcement learning (RL), Decision Transformer (DT) is emerging as a promising model based on Transformer. However, we discovered that the attention module of DT is not appropriate to capture the inherent local dependence pattern in trajectories of RL modeled as a Markov decision process. To overcome the limitations of DT, we propose a novel action sequence predictor, named Decision ConvFormer (DC), based on the architecture of MetaFormer, which is a general structure to process multiple entities in parallel and understand the interrelationship among the multiple entities. DC employs local convolution filtering as the token mixer and can effectively capture the inherent local associations of the RL dataset. In extensive experiments, DC achieved state-of-the-art performance across various standard RL benchmarks while requiring fewer resources. Furthermore, we show that DC better understands the underlying meaning in data and exhibits enhanced generalization capability. |
Jeonghye Kim · Suyoung Lee · Woojun Kim · Youngchul Sung 🔗 |
-
|
Reward Model Ensembles Help Mitigate Overoptimization
(
Poster
)
>
link
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization. |
Thomas Coste · Usman Anwar · Robert Kirk · David Krueger 🔗 |
-
|
Reward Model Ensembles Help Mitigate Overoptimization
(
Oral
)
>
link
Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization. |
Thomas Coste · Usman Anwar · Robert Kirk · David Krueger 🔗 |
-
|
Universal Visual Decomposer: Long-Horizon Manipulation Made Easy
(
Poster
)
>
link
Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long-horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework. |
Zichen "Charles" Zhang · Yunshuang Li · Osbert Bastani · Abhishek Gupta · Dinesh Jayaraman · Jason Ma · Luca Weihs 🔗 |
-
|
Universal Visual Decomposer: Long-Horizon Manipulation Made Easy
(
Oral
)
>
link
Real-world robotic tasks stretch over extended horizons and encompass multiple stages. Learning long-horizon manipulation tasks, however, is a long-standing challenge, and demands decomposing the overarching task into several manageable subtasks to facilitate policy learning and generalization to unseen tasks. Prior task decomposition methods require task-specific knowledge, are computationally intensive, and cannot readily be applied to new tasks. To address these shortcomings, we propose Universal Visual Decomposer (UVD), an off-the-shelf task decomposition method for visual long-horizon manipulation using pre-trained visual representations designed for robotic control. At a high level, UVD discovers subgoals by detecting phase shifts in the embedding space of the pre-trained representation. Operating purely on visual demonstrations without auxiliary information, UVD can effectively extract visual subgoals embedded in the videos, while incurring zero additional training cost on top of standard visuomotor policy training. Goal-conditioned policies learned with UVD-discovered subgoals exhibit significantly improved compositional generalization at test time to unseen tasks. Furthermore, UVD-discovered subgoals can be used to construct goal-based reward shaping that jump-starts temporally extended exploration for reinforcement learning. We extensively evaluate UVD on both simulation and real-world tasks, and in all cases, UVD substantially outperforms baselines across imitation and reinforcement learning settings on in-domain and out-of-domain task sequences alike, validating the clear advantage of automated visual task decomposition within the simple, compact UVD framework. |
Zichen "Charles" Zhang · Yunshuang Li · Osbert Bastani · Abhishek Gupta · Dinesh Jayaraman · Jason Ma · Luca Weihs 🔗 |
-
|
Fast Imitation via Behavior Foundation Models
(
Poster
)
>
link
Imitation learning (IL) aims at producing agents that can imitate anybehavior given a few expert demonstrations. Yet existing approaches requiremany demonstrations and/or running (online or offline) reinforcementlearning (RL) algorithms for each new imitation task. Here we show that recent RL foundation models based on successor measures canimitate any expert behavior almost instantlywith just a few demonstrations and no need for RL or fine-tuning, whileaccommodating several ILprinciples (behavioral cloning, feature matching, reward-based, andgoal-based reductions).In ourexperiments, imitation via RL foundation models matches, and oftensurpasses, the performance of SOTA offline IL algorithms, and producesimitation policies from new demonstrations within seconds instead of hours. |
Matteo Pirotta · Andrea Tirinzoni · Ahmed Touati · Alessandro Lazaric · Yann Ollivier 🔗 |
-
|
Fast Imitation via Behavior Foundation Models
(
Oral
)
>
link
Imitation learning (IL) aims at producing agents that can imitate anybehavior given a few expert demonstrations. Yet existing approaches requiremany demonstrations and/or running (online or offline) reinforcementlearning (RL) algorithms for each new imitation task. Here we show that recent RL foundation models based on successor measures canimitate any expert behavior almost instantlywith just a few demonstrations and no need for RL or fine-tuning, whileaccommodating several ILprinciples (behavioral cloning, feature matching, reward-based, andgoal-based reductions).In ourexperiments, imitation via RL foundation models matches, and oftensurpasses, the performance of SOTA offline IL algorithms, and producesimitation policies from new demonstrations within seconds instead of hours. |
Matteo Pirotta · Andrea Tirinzoni · Ahmed Touati · Alessandro Lazaric · Yann Ollivier 🔗 |
-
|
O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language models
(
Poster
)
>
link
Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to facilitate the in-context learning performance of LLM agents. We formally define LLM-powered policies with both text-based approaches and code-based approaches. We then introduce an Offline Data-driven Discovery and Distillation (O3D) framework to improve LLM-powered policies without finetuning. O3D automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) demonstrate that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs with both text-based-policy and code-based-policy. |
Yuchen Xiao · Yanchao Sun · Mengda Xu · Udari Madhushani · Jared Vann · Deepeka Garg · Sumitra Ganesh 🔗 |
-
|
O3D: Offline Data-driven Discovery and Distillation for Sequential Decision-Making with Large Language models
(
Oral
)
>
link
Recent advancements in large language models (LLMs) have exhibited promising performance in solving sequential decision-making problems. By imitating few-shot examples provided in the prompts (i.e., in-context learning), an LLM agent can interact with an external environment and complete given tasks without additional training. However, such few-shot examples are often insufficient to generate high quality solutions for complex and long-horizon tasks, while the limited context length cannot consume larger-scale demonstrations. To this end, we propose an offline learning framework that utilizes offline data at scale (e.g, logs of human interactions) to facilitate the in-context learning performance of LLM agents. We formally define LLM-powered policies with both text-based approaches and code-based approaches. We then introduce an Offline Data-driven Discovery and Distillation (O3D) framework to improve LLM-powered policies without finetuning. O3D automatically discovers reusable skills and distills generalizable knowledge across multiple tasks based on offline interaction data, advancing the capability of solving downstream tasks. Empirical results under two interactive decision-making benchmarks (ALFWorld and WebShop) demonstrate that O3D can notably enhance the decision-making capabilities of LLMs through the offline discovery and distillation process, and consistently outperform baselines across various LLMs with both text-based-policy and code-based-policy. |
Yuchen Xiao · Yanchao Sun · Mengda Xu · Udari Madhushani · Jared Vann · Deepeka Garg · Sumitra Ganesh 🔗 |
-
|
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
(
Poster
)
>
link
Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a ``type mismatch'' with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://sites.google.com/view/v-ptr. |
Chethan Bhateja · Derek Guo · Dibya Ghosh · Anikait Singh · Manan Tomar · Quan Vuong · Yevgen Chebotar · Sergey Levine · Aviral Kumar 🔗 |
-
|
Robotic Offline RL from Internet Videos via Value-Function Pre-Training
(
Oral
)
>
link
Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a ``type mismatch'' with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at https://sites.google.com/view/v-ptr. |
Chethan Bhateja · Derek Guo · Dibya Ghosh · Anikait Singh · Manan Tomar · Quan Vuong · Yevgen Chebotar · Sergey Levine · Aviral Kumar 🔗 |
-
|
RoboVQA: Multimodal Long-Horizon Reasoningfor Robotics
(
Poster
)
>
link
We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple embodiments (robot, human, human with grasping tool). With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We explore the economics of collection costs and find that for a fixed budget it is beneficial to take advantage of the cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed \algname{} containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named \modelname{} trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46\% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19\% across all VQA tasks. Thanks to video conditioning and dataset diversity, the model can be used as general video value functions (e.g. success and affordance) in situations where actions needs to be recognized rather than states, expanding capabilities and environment understanding for robots. Data and videos are available at anonymous-robovqa.github.io |
Pierre Sermanet · Tianli Ding · Jeffrey Zhao · Fei Xia · Debidatta Dwibedi · Keerthana Gopalakrishnan · Gabriel Dulac-Arnold · sharath maddineni · Nikhil Joshi · Pete Florence · Wei Han · Robert Baruch · Yao Lu · Suvir Mirchandani · Peng Xu · Pannag Sanketi · Karol Hausman · Izhak Shafran · brian ichter · Yuan Cao
|
-
|
RoboVQA: Multimodal Long-Horizon Reasoningfor Robotics
(
Oral
)
>
link
We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple embodiments (robot, human, human with grasping tool). With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We explore the economics of collection costs and find that for a fixed budget it is beneficial to take advantage of the cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed \algname{} containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named \modelname{} trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46\% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19\% across all VQA tasks. Thanks to video conditioning and dataset diversity, the model can be used as general video value functions (e.g. success and affordance) in situations where actions needs to be recognized rather than states, expanding capabilities and environment understanding for robots. Data and videos are available at anonymous-robovqa.github.io |
Pierre Sermanet · Tianli Ding · Jeffrey Zhao · Fei Xia · Debidatta Dwibedi · Keerthana Gopalakrishnan · Gabriel Dulac-Arnold · sharath maddineni · Nikhil Joshi · Pete Florence · Wei Han · Robert Baruch · Yao Lu · Suvir Mirchandani · Peng Xu · Pannag Sanketi · Karol Hausman · Izhak Shafran · brian ichter · Yuan Cao
|
-
|
Importance of Directional Feedback for LLM-based Optimizers
(
Poster
)
>
link
We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems on a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space.We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations.Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques. |
Allen Nie · Ching-An Cheng · Andrey Kolobov · Adith Swaminathan 🔗 |
-
|
Importance of Directional Feedback for LLM-based Optimizers
(
Oral
)
>
link
We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems on a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space.We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations.Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques. |
Allen Nie · Ching-An Cheng · Andrey Kolobov · Adith Swaminathan 🔗 |
-
|
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
(
Poster
)
>
link
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research. |
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith 🔗 |
-
|
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
(
Oral
)
>
link
Constructing AI models that respond to text instructions is challenging, especially for sequential decision-making tasks. This work introduces an instruction-tuned Video Pretraining (VPT) model for Minecraft called STEVE-1, demonstrating that the unCLIP approach, utilized in DALL•E 2, is also effective for creating instruction-following sequential decision-making agents. STEVE-1 is trained in two steps: adapting the pretrained VPT model to follow commands in MineCLIP's latent space, then training a prior to predict latent codes from text. This allows us to finetune VPT through self-supervised behavioral cloning and hindsight relabeling, bypassing the need for costly human text annotations. By leveraging pretrained models like VPT and MineCLIP and employing best practices from text-conditioned image generation, STEVE-1 costs just $60 to train and can follow a wide range of short-horizon open-ended text and visual instructions in Minecraft. STEVE-1 sets a new bar for open-ended instruction following in Minecraft with low-level controls (mouse and keyboard) and raw pixel inputs, far outperforming previous baselines. We provide experimental evidence highlighting key factors for downstream performance, including pretraining, classifier-free guidance, and data scaling. All resources, including our model weights, training scripts, and evaluation tools are made available for further research. |
Shalev Lifshitz · Keiran Paster · Harris Chan · Jimmy Ba · Sheila McIlraith 🔗 |
-
|
GPT-Driver: Learning to Drive with GPT
(
Poster
)
>
link
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles. Motion planning is a core challenge in autonomous driving, aiming to plan a driving trajectory that is safe and comfortable. Existing motion planners predominantly leverage heuristic methods to forecast driving trajectories, yet these approaches demonstrate insufficient generalization capabilities in the face of novel and unseen driving scenarios. In this paper, we propose a novel approach to motion planning that capitalizes on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs). The fundamental insight of our approach is the reformulation of motion planning as a language modeling problem, a perspective not previously explored. Specifically, we represent the planner inputs and outputs as language tokens, and leverage the LLM to generate driving trajectories through a language description of coordinate positions. Furthermore, we propose a novel prompting-reasoning-finetuning strategy to stimulate the numerical reasoning potential of the LLM. With this strategy, the LLM can describe highly precise trajectory coordinates and also its internal decision-making process in natural language. We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner. |
Jiageng Mao · Yuxi Qian · Hang Zhao · Yue Wang 🔗 |
-
|
GPT-Driver: Learning to Drive with GPT
(
Oral
)
>
link
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles. Motion planning is a core challenge in autonomous driving, aiming to plan a driving trajectory that is safe and comfortable. Existing motion planners predominantly leverage heuristic methods to forecast driving trajectories, yet these approaches demonstrate insufficient generalization capabilities in the face of novel and unseen driving scenarios. In this paper, we propose a novel approach to motion planning that capitalizes on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs). The fundamental insight of our approach is the reformulation of motion planning as a language modeling problem, a perspective not previously explored. Specifically, we represent the planner inputs and outputs as language tokens, and leverage the LLM to generate driving trajectories through a language description of coordinate positions. Furthermore, we propose a novel prompting-reasoning-finetuning strategy to stimulate the numerical reasoning potential of the LLM. With this strategy, the LLM can describe highly precise trajectory coordinates and also its internal decision-making process in natural language. We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner. |
Jiageng Mao · Yuxi Qian · Hang Zhao · Yue Wang 🔗 |
-
|
GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems
(
Poster
)
>
link
There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning, there is still the wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity, that should be irrelevant to LLMs to the extent what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of {\em Graph Coloring}, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings--both in direct and iterative modes. In iterative modes, we experiment both with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed effectiveness of LLMs in iterative settings is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs. |
Kaya Stechly · Matthew Marquez · Subbarao Kambhampati 🔗 |
-
|
GPT-4 Doesn’t Know It’s Wrong: An Analysis of Iterative Prompting for Reasoning Problems
(
Oral
)
>
link
There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning, there is still the wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity, that should be irrelevant to LLMs to the extent what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of {\em Graph Coloring}, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings--both in direct and iterative modes. In iterative modes, we experiment both with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed effectiveness of LLMs in iterative settings is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs. |
Kaya Stechly · Matthew Marquez · Subbarao Kambhampati 🔗 |
-
|
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
(
Poster
)
>
link
Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, thus lacking general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search learning framework for LLMs (termed TS-LLM), systematically showing how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64. |
Xidong Feng · Ziyu Wan · Muning Wen · Ying Wen · Weinan Zhang · Jun Wang 🔗 |
-
|
Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training
(
Oral
)
>
link
Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, thus lacking general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search learning framework for LLMs (termed TS-LLM), systematically showing how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64. |
Xidong Feng · Ziyu Wan · Muning Wen · Ying Wen · Weinan Zhang · Jun Wang 🔗 |
-
|
Voyager: An Open-Ended Embodied Agent with Large Language Models
(
Poster
)
>
link
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. |
Guanzhi Wang · Yuqi Xie · Yunfan Jiang · Ajay Mandlekar · Chaowei Xiao · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Voyager: An Open-Ended Embodied Agent with Large Language Models
(
Oral
)
>
link
We introduce Voyager, the first LLM-powered embodied lifelong learning agent in an open-ended world that continuously explores, acquires diverse skills, and makes novel discoveries without human intervention in Minecraft. Voyager consists of three key components: 1) an automatic curriculum that maximizes exploration, 2) an ever-growing skill library of executable code for storing and retrieving complex behaviors, and 3) a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses the need for model parameter fine-tuning. The skills developed by Voyager are temporally extended, interpretable, and compositional, which compounds the agent’s capability rapidly and alleviates catastrophic forgetting. Empirically, Voyager demonstrates strong in-context lifelong learning capabilities. It outperforms prior SOTA by obtaining 3.1x more unique items, unlocking tech tree milestones up to 15.3x faster, and traveling 2.3x longer distances. Voyager is able to utilize the learned skill library in a new Minecraft world to solve novel tasks from scratch, while other techniques struggle to generalize. |
Guanzhi Wang · Yuqi Xie · Yunfan Jiang · Ajay Mandlekar · Chaowei Xiao · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Double Policy Estimation for Importance Sampling in Sequence Modeling-Based Reinforcement Learning
(
Poster
)
>
link
Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings performance improvements on selected methods and outperforms state-of-the-art baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning. |
Hanhan Zhou · Tian Lan · Vaneet Aggarwal 🔗 |
-
|
Double Policy Estimation for Importance Sampling in Sequence Modeling-Based Reinforcement Learning
(
Oral
)
>
link
Offline reinforcement learning aims to utilize datasets of previously gathered environment-action interaction records to learn a policy without access to the real environment. Recent work has shown that offline reinforcement learning can be formulated as a sequence modeling problem and solved via supervised learning with approaches such as decision transformer. While these sequence-based methods achieve competitive results over return-to-go methods, especially on tasks that require longer episodes or with scarce rewards, importance sampling is not considered to correct the policy bias when dealing with off-policy data, mainly due to the absence of behavior policy and the use of deterministic evaluation policies. To this end, we propose an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation (DPE) in a unified framework with statistically proven properties on variance reduction. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our method brings performance improvements on selected methods and outperforms state-of-the-art baselines in several tasks, demonstrating the advantages of enabling double policy estimation for sequence-modeled reinforcement learning. |
Hanhan Zhou · Tian Lan · Vaneet Aggarwal 🔗 |
-
|
Strategic Reasoning with Language Models
(
Poster
)
>
link
Strategic reasoning enables agents to cooperate, communicate, and compete with other agents in diverse situations. Existing approaches to solving strategic games rely on extensive training, yielding strategies that do not generalize to new scenarios or games without retraining. Large Language Models (LLMs), with their ability to comprehend and generate complex, context-rich language, could prove powerful as tools for strategic gameplay. This paper introduces an approach that uses pretrained LLMs with few-shot chain-of-thought examples to enable strategic reasoning for AI agents. Our approach uses systematically generated demonstrations of reasoning about states, values, and beliefs to prompt the model. Using extensive variations of simple matrix games, we show that strategies that are derived based on systematically generated prompts generalize almost perfectly to new game structures, alternate objectives, and hidden information. Additionally, we demonstrate our approach can lead to human-like negotiation strategies in realistic scenarios without any extra training or fine-tuning. Our results highlight the ability of LLMs, guided by systematic reasoning demonstrations, to adapt and excel in diverse strategic scenarios. |
Kanishk Gandhi · Dorsa Sadigh · Noah Goodman 🔗 |
-
|
Strategic Reasoning with Language Models
(
Oral
)
>
link
Strategic reasoning enables agents to cooperate, communicate, and compete with other agents in diverse situations. Existing approaches to solving strategic games rely on extensive training, yielding strategies that do not generalize to new scenarios or games without retraining. Large Language Models (LLMs), with their ability to comprehend and generate complex, context-rich language, could prove powerful as tools for strategic gameplay. This paper introduces an approach that uses pretrained LLMs with few-shot chain-of-thought examples to enable strategic reasoning for AI agents. Our approach uses systematically generated demonstrations of reasoning about states, values, and beliefs to prompt the model. Using extensive variations of simple matrix games, we show that strategies that are derived based on systematically generated prompts generalize almost perfectly to new game structures, alternate objectives, and hidden information. Additionally, we demonstrate our approach can lead to human-like negotiation strategies in realistic scenarios without any extra training or fine-tuning. Our results highlight the ability of LLMs, guided by systematic reasoning demonstrations, to adapt and excel in diverse strategic scenarios. |
Kanishk Gandhi · Dorsa Sadigh · Noah Goodman 🔗 |
-
|
LLMs-augmented Contextual Bandit
(
Poster
)
>
link
Contextual bandits have emerged as a cornerstone in reinforcement learning, enabling systems to make decisions with partial feedback. However, as contexts grow in complexity, traditional bandit algorithms can face challenges in adequately capturing and utilizing such contexts. In this paper, we propose a novel integration of large language models (LLMs) with the contextual bandit framework. By leveraging LLMs as an encoder, we enrich the representation of the context, providing the bandit with a denser and more informative view. Preliminary results on synthetic datasets demonstrate the potential of this approach, showing notable improvements in cumulative rewards and reductions in regret compared to traditional bandit algorithms. This integration not only showcases the capabilities of LLMs in reinforcement learning but also opens the door to a new era of contextually-aware decision systems. |
Ali Baheri · Cecilia Alm 🔗 |
-
|
LLMs-augmented Contextual Bandit
(
Oral
)
>
link
Contextual bandits have emerged as a cornerstone in reinforcement learning, enabling systems to make decisions with partial feedback. However, as contexts grow in complexity, traditional bandit algorithms can face challenges in adequately capturing and utilizing such contexts. In this paper, we propose a novel integration of large language models (LLMs) with the contextual bandit framework. By leveraging LLMs as an encoder, we enrich the representation of the context, providing the bandit with a denser and more informative view. Preliminary results on synthetic datasets demonstrate the potential of this approach, showing notable improvements in cumulative rewards and reductions in regret compared to traditional bandit algorithms. This integration not only showcases the capabilities of LLMs in reinforcement learning but also opens the door to a new era of contextually-aware decision systems. |
Ali Baheri · Cecilia Alm 🔗 |
-
|
N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics
(
Poster
)
>
link
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors. |
Sajad Mousavi · Ricardo Luna Gutierrez · Desik Rengarajan · Vineet Gundecha · Ashwin Ramesh Babu · Avisek Naug · Antonio Guillen-Perez · Soumyendu Sarkar 🔗 |
-
|
N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics
(
Oral
)
>
link
We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors. |
Sajad Mousavi · Ricardo Luna Gutierrez · Desik Rengarajan · Vineet Gundecha · Ashwin Ramesh Babu · Avisek Naug · Antonio Guillen-Perez · Soumyendu Sarkar 🔗 |
-
|
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
(
Poster
)
>
link
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. |
Wenlong Huang · Chen Wang · Ruohan Zhang · Yunzhu Li · Jiajun Wu · Fei-Fei Li 🔗 |
-
|
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
(
Oral
)
>
link
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation in the form of reasoning and planning. Despite the progress, most still rely on pre-defined motion primitives to carry out the physical interactions with the environment, which remains a major bottleneck. In this work, we aim to synthesize robot trajectories, i.e., a dense sequence of 6-DoF end-effector waypoints, for a large variety of manipulation tasks given an open-set of instructions and an open-set of objects. We achieve this by first observing that LLMs excel at inferring affordances and constraints given a free-form language instruction. More importantly, by leveraging their code-writing capabilities, they can interact with a vision-language model (VLM) to compose 3D value maps to ground the knowledge into the observation space of the agent. The composed value maps are then used in a model-based planning framework to zero-shot synthesize closed-loop robot trajectories with robustness to dynamic perturbations. We further demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions. We present a large-scale study of the proposed method in both simulated and real-robot environments, showcasing the ability to perform a large variety of everyday manipulation tasks specified in free-form natural language. |
Wenlong Huang · Chen Wang · Ruohan Zhang · Yunzhu Li · Jiajun Wu · Fei-Fei Li 🔗 |
-
|
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
(
Poster
)
>
link
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zeroshot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/anon-vlmrm. We can improve performance by providing a second “baseline” prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications. |
Juan Rocamonde · Victoriano Montesinos · Elvis Nava · Ethan Perez · David Lindner 🔗 |
-
|
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
(
Oral
)
>
link
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zeroshot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: https://sites.google.com/view/anon-vlmrm. We can improve performance by providing a second “baseline” prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications. |
Juan Rocamonde · Victoriano Montesinos · Elvis Nava · Ethan Perez · David Lindner 🔗 |
-
|
Goal Masked Diffusion Policies for Unified Navigation and Exploration
(
Poster
)
>
link
Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models. We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches. |
Ajay Sridhar · Dhruv Shah · Catherine Glossop · Sergey Levine 🔗 |
-
|
Goal Masked Diffusion Policies for Unified Navigation and Exploration
(
Oral
)
>
link
Robotic learning for navigation in unfamiliar environments needs to provide policies for both task-oriented navigation (i.e., reaching a goal that the robot has located), and task-agnostic exploration (i.e., searching for a goal in a novel setting). Typically, these roles are handled by separate models, for example by using subgoal proposals, planning, or separate navigation strategies. In this paper, we describe how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration, with the latter providing the ability to search novel environments, and the former providing the ability to reach a user-specified goal once it has been located. We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments, as compared to approaches that use subgoal proposals from generative models, or prior methods based on latent variable models. We instantiate our method by using a large-scale Transformer-based policy trained on data from multiple ground robots, with a diffusion model decoder to flexibly handle both goal-conditioned and goal-agnostic navigation. Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods, and demonstrate significant improvements in performance and lower collision rates, despite utilizing smaller models than state-of-the-art approaches. |
Ajay Sridhar · Dhruv Shah · Catherine Glossop · Sergey Levine 🔗 |
-
|
Learning to Solve New sequential decision-making Tasks with In-Context Learning
(
Poster
)
>
link
Training autonomous agents that can generalize to new tasks from a small number of demonstrations is a long-standing problem in machine learning. Recently, transformers have displayed impressive few-shot learning capabilities on a wide range of domains in language and vision. However, the sequential dcision-making setting poses additional challenges and has a much lower tolerance for errors since the environment's stochasticity or the agent's wrong actions can lead to unseen (and sometimes unrecoverable) states. In this paper, we use an illustrative example to show that a naive approach to using transformers in sequential decision-making problems does not lead to few-shot learning. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to few-shot learning in new sequential decision-making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity and trajectory burstiness, all result in better generalization to out-of-distribution tasks given just a few demonstrations per task. Leveraging these insights, we demonstrate our model's generalization to unseen MiniHack and Procgen tasks via in-context learning from just a handful of expert demonstrations per task. |
Sharath Chandra Raparthy · Eric Hambro · Robert Kirk · Mikael Henaff · Roberta Raileanu 🔗 |
-
|
Learning to Solve New sequential decision-making Tasks with In-Context Learning
(
Oral
)
>
link
Training autonomous agents that can generalize to new tasks from a small number of demonstrations is a long-standing problem in machine learning. Recently, transformers have displayed impressive few-shot learning capabilities on a wide range of domains in language and vision. However, the sequential dcision-making setting poses additional challenges and has a much lower tolerance for errors since the environment's stochasticity or the agent's wrong actions can lead to unseen (and sometimes unrecoverable) states. In this paper, we use an illustrative example to show that a naive approach to using transformers in sequential decision-making problems does not lead to few-shot learning. We then demonstrate how training on sequences of trajectories with certain distributional properties leads to few-shot learning in new sequential decision-making tasks. We investigate different design choices and find that larger model and dataset sizes, as well as more task diversity, environment stochasticity and trajectory burstiness, all result in better generalization to out-of-distribution tasks given just a few demonstrations per task. Leveraging these insights, we demonstrate our model's generalization to unseen MiniHack and Procgen tasks via in-context learning from just a handful of expert demonstrations per task. |
Sharath Chandra Raparthy · Eric Hambro · Robert Kirk · Mikael Henaff · Roberta Raileanu 🔗 |
-
|
ExPT: Scaling Foundation Models for Experimental Design via Synthetic Pretraining
(
Poster
)
>
link
Experimental design is a fundamental problem in many science and engineering fields. In this problem, sample efficiency is crucial due to the time, money, and safety costs of real-world design evaluations. Existing approaches either rely on active data collection or access to large, labeled datasets of past experiments, making them impractical in many real-world scenarios. In this work, we address the more challenging yet realistic setting of few-shot experimental design, where only a few labeled data points of input designs and their corresponding values are available. We approach this problem as a conditional generation task, where a model conditions on a few labeled examples and the desired output to generate an optimal input design. To this end, we present Pretrained Transformers for Experimental Design (ExPT), which uses a novel combination of synthetic pretraining with in-context learning to enable few-shot generalization. In ExPT, we only assume knowledge of a finite collection of unlabelled data points from the input domain and pretrain a transformer neural network to optimize diverse synthetic functions defined over this domain. Unsupervised pretraining allows ExPT to adapt to any design task at test time in an in-context fashion by conditioning on a few labeled data points from the target task and generating the candidate optima. We evaluate ExPT on few-shot experimental design in challenging domains and demonstrate its superior generality and performance compared to existing methods. |
Tung Nguyen · Sudhanshu Agrawal · Aditya Grover 🔗 |
-
|
ExPT: Scaling Foundation Models for Experimental Design via Synthetic Pretraining
(
Oral
)
>
link
Experimental design is a fundamental problem in many science and engineering fields. In this problem, sample efficiency is crucial due to the time, money, and safety costs of real-world design evaluations. Existing approaches either rely on active data collection or access to large, labeled datasets of past experiments, making them impractical in many real-world scenarios. In this work, we address the more challenging yet realistic setting of few-shot experimental design, where only a few labeled data points of input designs and their corresponding values are available. We approach this problem as a conditional generation task, where a model conditions on a few labeled examples and the desired output to generate an optimal input design. To this end, we present Pretrained Transformers for Experimental Design (ExPT), which uses a novel combination of synthetic pretraining with in-context learning to enable few-shot generalization. In ExPT, we only assume knowledge of a finite collection of unlabelled data points from the input domain and pretrain a transformer neural network to optimize diverse synthetic functions defined over this domain. Unsupervised pretraining allows ExPT to adapt to any design task at test time in an in-context fashion by conditioning on a few labeled data points from the target task and generating the candidate optima. We evaluate ExPT on few-shot experimental design in challenging domains and demonstrate its superior generality and performance compared to existing methods. |
Tung Nguyen · Sudhanshu Agrawal · Aditya Grover 🔗 |
-
|
Natural Language-based State Representation in Deep Reinforcement Learning
(
Poster
)
>
link
This study investigates the potential of using natural language descriptions as an alternative to direct image-based observations for learning policies in reinforcement learning. Due to the inherent challenges in managing image-based observations, which include abundant information and irrelevant features, we propose a method that compresses images into a natural language form for state representation. This approach allows better interpretability and leverages the processing capabilities of large language models (LLMs). We conducted several experiments involving tasks that required image-based observation. The results demonstrated that policies trained using natural language descriptions of images yield better generalization than those trained directly from images, emphasizing the potential of this approach in practical settings. |
Masudur Rahman · Yexiang Xue 🔗 |
-
|
Natural Language-based State Representation in Deep Reinforcement Learning
(
Oral
)
>
link
This study investigates the potential of using natural language descriptions as an alternative to direct image-based observations for learning policies in reinforcement learning. Due to the inherent challenges in managing image-based observations, which include abundant information and irrelevant features, we propose a method that compresses images into a natural language form for state representation. This approach allows better interpretability and leverages the processing capabilities of large language models (LLMs). We conducted several experiments involving tasks that required image-based observation. The results demonstrated that policies trained using natural language descriptions of images yield better generalization than those trained directly from images, emphasizing the potential of this approach in practical settings. |
Masudur Rahman · Yexiang Xue 🔗 |
-
|
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
(
Poster
)
>
link
In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-based visual question answering benchmarks such as Infoseek and OK-VQA. |
Ziniu Hu 🔗 |
-
|
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
(
Oral
)
>
link
In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-based visual question answering benchmarks such as Infoseek and OK-VQA. |
Ziniu Hu 🔗 |
-
|
Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting
(
Poster
)
>
link
Large language models (LLMs) demonstrate remarkable medical expertise, but data privacy concerns impede their direct use in healthcare environments. Although offering improved data privacy protection, domain-specific small language models (SLMs) often underperform LLMs, emphasizing the need for methods that reduce this performance gap while alleviating privacy concerns. In this paper, we present a simple yet effective method that harnesses LLMs' medical proficiency to boost SLM performance in medical tasks under $privacy-restricted$ scenarios. Specifically, we mitigate patient privacy issues by extracting keywords from medical data and prompting the LLM to generate a medical knowledge-intensive context by simulating clinicians' thought processes. This context serves as additional input for SLMs, augmenting their decision-making capabilities. Our method significantly enhances performance in both few-shot and full training settings across three medical knowledge-intensive tasks, achieving up to a 22.57\% increase in absolute accuracy compared to SLM fine-tuning without context, and sets new state-of-the-art results in two medical tasks within privacy-restricted scenarios. Further out-of-domain testing and experiments in two general domain datasets showcase its generalizability and broad applicability.
|
Xinlu Zhang · Shiyang Li · Xianjun Yang · Chenxin Tian · Yao Qin · Linda Petzold 🔗 |
-
|
Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting
(
Oral
)
>
link
Large language models (LLMs) demonstrate remarkable medical expertise, but data privacy concerns impede their direct use in healthcare environments. Although offering improved data privacy protection, domain-specific small language models (SLMs) often underperform LLMs, emphasizing the need for methods that reduce this performance gap while alleviating privacy concerns. In this paper, we present a simple yet effective method that harnesses LLMs' medical proficiency to boost SLM performance in medical tasks under $privacy-restricted$ scenarios. Specifically, we mitigate patient privacy issues by extracting keywords from medical data and prompting the LLM to generate a medical knowledge-intensive context by simulating clinicians' thought processes. This context serves as additional input for SLMs, augmenting their decision-making capabilities. Our method significantly enhances performance in both few-shot and full training settings across three medical knowledge-intensive tasks, achieving up to a 22.57\% increase in absolute accuracy compared to SLM fine-tuning without context, and sets new state-of-the-art results in two medical tasks within privacy-restricted scenarios. Further out-of-domain testing and experiments in two general domain datasets showcase its generalizability and broad applicability.
|
Xinlu Zhang · Shiyang Li · Xianjun Yang · Chenxin Tian · Yao Qin · Linda Petzold 🔗 |
-
|
LLM Augmented Hierarchical Agents
(
Poster
)
>
link
Solving long horizon temporally extended tasks using Reinforcement Learning (RL) is extremely challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally extended actions and learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have the same capabilities. Recently, LLMs have shown to encode tremendous amount of knowledge about the world and impressive in-context learning and reasoning capabilities. However, using LLMs to solve real world tasks is challenging as these models are not grounded in the current task. We want to leverage the planning capabilities of LLMs while using RL to provide the essential environment interaction. In this paper, we present a hierarchical agent which uses LLMs to solve long horizon tasks. Instead of completely relying on LLMs, we use them to guide the high-level policy making them significantly more sample efficient. We evaluate our method on simulation environments such as MiniGrid, SkillHack, Crafter and on a real robot arm in block manipulation tasks. We show that agents trained using our method outperform other baselines methods and once trained, they don't depend on LLMs during deployment. |
Bharat Prakash · Tim Oates · Tinoosh Mohsenin 🔗 |
-
|
LLM Augmented Hierarchical Agents
(
Oral
)
>
link
Solving long horizon temporally extended tasks using Reinforcement Learning (RL) is extremely challenging, compounded by the common practice of learning without prior knowledge (or tabula rasa learning). Humans can generate and execute plans with temporally extended actions and learn to perform new tasks because we almost never solve problems from scratch. We want autonomous agents to have the same capabilities. Recently, LLMs have shown to encode tremendous amount of knowledge about the world and impressive in-context learning and reasoning capabilities. However, using LLMs to solve real world tasks is challenging as these models are not grounded in the current task. We want to leverage the planning capabilities of LLMs while using RL to provide the essential environment interaction. In this paper, we present a hierarchical agent which uses LLMs to solve long horizon tasks. Instead of completely relying on LLMs, we use them to guide the high-level policy making them significantly more sample efficient. We evaluate our method on simulation environments such as MiniGrid, SkillHack, Crafter and on a real robot arm in block manipulation tasks. We show that agents trained using our method outperform other baselines methods and once trained, they don't depend on LLMs during deployment. |
Bharat Prakash · Tim Oates · Tinoosh Mohsenin 🔗 |
-
|
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents
(
Poster
)
>
link
With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement. |
Jingqing Ruan · YiHong Chen · Bin Zhang · Zhiwei Xu · Tianpeng Bao · du qing · shi shiwei · Hangyu Mao · Xingyu Zeng · Rui Zhao 🔗 |
-
|
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents
(
Oral
)
>
link
With recent advancements in natural language processing, Large Language Models (LLMs) have emerged as powerful tools for various real-world applications. Despite their prowess, the intrinsic generative abilities of LLMs may prove insufficient for handling complex tasks which necessitate a combination of task planning and the usage of external tools. In this paper, we first propose a structured framework tailored for LLM-based AI Agents and discuss the crucial capabilities necessary for tackling intricate problems. Within this framework, we design two distinct types of agents (i.e., one-step agent and sequential agent) to execute the inference process. Subsequently, we instantiate the framework using various LLMs and evaluate their Task Planning and Tool Usage (TPTU) abilities on typical tasks. By highlighting key findings and challenges, our goal is to provide a helpful resource for researchers and practitioners to leverage the power of LLMs in their AI applications. Our study emphasizes the substantial potential of these models, while also identifying areas that need more investigation and improvement. |
Jingqing Ruan · YiHong Chen · Bin Zhang · Zhiwei Xu · Tianpeng Bao · du qing · shi shiwei · Hangyu Mao · Xingyu Zeng · Rui Zhao 🔗 |
-
|
$\mathcal{B}$-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesis
(
Poster
)
>
link
Program synthesis aims to create accurate, executable code from natural language descriptions. This field has leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. This integration focuses on directly optimizing functional correctness, transcending conventional supervised losses. While current literature predominantly favors policy-based algorithms, attributes of program synthesis suggest a natural compatibility with value-based methods. This stems from rich collection of off-policy programs developed by human programmers, and the straightforward verification of generated programs through automated unit testing (i.e. easily obtainable rewards in RL language). Diverging from the predominant use of policy-based algorithms, our work explores the applicability of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we propose an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance compared with policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
|
Zishun Yu · Yunzhe Tao · Liyu Chen · Tao Sun · Hongxia Yang 🔗 |
-
|
$\mathcal{B}$-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesis
(
Oral
)
>
link
Program synthesis aims to create accurate, executable code from natural language descriptions. This field has leveraged the power of reinforcement learning (RL) in conjunction with large language models (LLMs), significantly enhancing code generation capabilities. This integration focuses on directly optimizing functional correctness, transcending conventional supervised losses. While current literature predominantly favors policy-based algorithms, attributes of program synthesis suggest a natural compatibility with value-based methods. This stems from rich collection of off-policy programs developed by human programmers, and the straightforward verification of generated programs through automated unit testing (i.e. easily obtainable rewards in RL language). Diverging from the predominant use of policy-based algorithms, our work explores the applicability of value-based approaches, leading to the development of our $\mathcal{B}$-Coder (pronounced Bellman coder). Yet, training value-based methods presents challenges due to the enormous search space inherent to program synthesis. To this end, we propose an initialization protocol for RL agents utilizing pre-trained LMs and a conservative Bellman operator to reduce training complexities. Moreover, we demonstrate how to leverage the learned value functions as a dual strategy to post-process generated programs. Our empirical evaluations demonstrated $\mathcal{B}$-Coder's capability in achieving state-of-the-art performance compared with policy-based methods. Remarkably, this achievement is reached with minimal reward engineering effort, highlighting the effectiveness of value-based RL, independent of reward designs.
|
Zishun Yu · Yunzhe Tao · Liyu Chen · Tao Sun · Hongxia Yang 🔗 |
-
|
TD-MPC2: Scalable, Robust World Models for Continuous Control
(
Poster
)
>
link
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents.Explore videos, models, data, code, and more at https://tdmpc2.com |
Nicklas Hansen · Hao Su · Xiaolong Wang 🔗 |
-
|
TD-MPC2: Scalable, Robust World Models for Continuous Control
(
Oral
)
>
link
TD-MPC is a model-based reinforcement learning (RL) algorithm that performs local trajectory optimization in the latent space of a learned implicit (decoder-free) world model. In this work, we present TD-MPC2: a series of improvements upon the TD-MPC algorithm. We demonstrate that TD-MPC2 improves significantly over baselines across 104 online RL tasks spanning 4 diverse task domains, achieving consistently strong results with a single set of hyperparameters. We further show that agent capabilities increase with model and data size, and successfully train a single 317M parameter agent to perform 80 tasks across multiple task domains, embodiments, and action spaces. We conclude with an account of lessons, opportunities, and risks associated with large TD-MPC2 agents.Explore videos, models, data, code, and more at https://tdmpc2.com |
Nicklas Hansen · Hao Su · Xiaolong Wang 🔗 |
-
|
Scaling Offline Q-Learning with Vision Transformers
(
Poster
)
>
link
It has been shown that offline RL methods, such as conservative Q-learning~(CQL), scale favorably for training generalist agents with a ResNet backbone. Recent vision and natural language processing research shows that transformer-based models scale more favorably compared to domain specific models with strong inductive biases (such as convolutional neural networks and recurrent neural networks). In this paper, we investigate how well visual transformers (ViTs) serve as backbones for CQL for training single-game agents. In this work, we enhance the Vision Transformer (ViT) for image-based RL by introducing spatio-temporal attention layers. We further investigate the impact of various embedding sequence aggregation methods on ViT performance and demonstrate that the prevalent mean pooling aggregation is suboptimal. Overall, our modified ViT outperforms the standard ViTs in the single-game Atari setting. |
Yingjie Miao · Jordi Orbay · Rishabh Agarwal · Aviral Kumar · George Tucker · Aleksandra Faust 🔗 |
-
|
Scaling Offline Q-Learning with Vision Transformers
(
Oral
)
>
link
It has been shown that offline RL methods, such as conservative Q-learning~(CQL), scale favorably for training generalist agents with a ResNet backbone. Recent vision and natural language processing research shows that transformer-based models scale more favorably compared to domain specific models with strong inductive biases (such as convolutional neural networks and recurrent neural networks). In this paper, we investigate how well visual transformers (ViTs) serve as backbones for CQL for training single-game agents. In this work, we enhance the Vision Transformer (ViT) for image-based RL by introducing spatio-temporal attention layers. We further investigate the impact of various embedding sequence aggregation methods on ViT performance and demonstrate that the prevalent mean pooling aggregation is suboptimal. Overall, our modified ViT outperforms the standard ViTs in the single-game Atari setting. |
Yingjie Miao · Jordi Orbay · Rishabh Agarwal · Aviral Kumar · George Tucker · Aleksandra Faust 🔗 |
-
|
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
(
Poster
)
>
link
Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks—such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tail risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes toolkits and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment. |
Yangjun Ruan · Honghua Dong · Andrew Wang · Silviu Pitis · Yongchao Zhou · Jimmy Ba · Yann Dubois · Chris Maddison · Tatsunori Hashimoto 🔗 |
-
|
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
(
Oral
)
>
link
Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks—such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tail risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes toolkits and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment. |
Yangjun Ruan · Honghua Dong · Andrew Wang · Silviu Pitis · Yongchao Zhou · Jimmy Ba · Yann Dubois · Chris Maddison · Tatsunori Hashimoto 🔗 |
-
|
Using Large Language Models for Hyperparameter Optimization
(
Poster
)
>
link
This paper studies the use of foundational large language models (LLMs) to make decisions during hyperparameter optimization (HPO).Our primary objective is to understand the performance and limitations of using LLMs for HPO.We study the behavior of LLMs when optimizing simple 2-dimensional landscapes to evaluate the LLM search algorithm and chain-of-thought reasoning.We look at classical setups, with pre-specified search spaces over a small number of hyperparameters, showing an improved initial search phase compared to existing methods.Furthermore, we propose to treat the code specifying our model as a hyperparameter, which the LLM outputs, going beyond the capabilities of existing HPO methods.Our contributions shed light on a new application of using foundational LLMs on the traditional decision making problem of hyperparameter optimization. |
Michael Zhang · Nishkrit Desai · Juhan Bae · Jonathan Lorraine · Jimmy Ba 🔗 |
-
|
Using Large Language Models for Hyperparameter Optimization
(
Oral
)
>
link
This paper studies the use of foundational large language models (LLMs) to make decisions during hyperparameter optimization (HPO).Our primary objective is to understand the performance and limitations of using LLMs for HPO.We study the behavior of LLMs when optimizing simple 2-dimensional landscapes to evaluate the LLM search algorithm and chain-of-thought reasoning.We look at classical setups, with pre-specified search spaces over a small number of hyperparameters, showing an improved initial search phase compared to existing methods.Furthermore, we propose to treat the code specifying our model as a hyperparameter, which the LLM outputs, going beyond the capabilities of existing HPO methods.Our contributions shed light on a new application of using foundational LLMs on the traditional decision making problem of hyperparameter optimization. |
Michael Zhang · Nishkrit Desai · Juhan Bae · Jonathan Lorraine · Jimmy Ba 🔗 |
-
|
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
(
Poster
)
>
link
Real-world environments often have large-scale domains that make classical planning intractable. Large language models (LLMs) have been used as few-shot planning policies due to their commonsense knowledge for solving daily problems. However, LLMs' potential for planning complex tasks is not fully tapped. This paper shows that LLMs can be used as both the commonsense world model and the heuristic policy in search algorithms such as Monte Carlo Tree Search (MCTS). LLM's world model provides a commonsense prior belief of states for MCTS to achieve reasoned decision-making efficiently. The LLM's heuristic policy guides the search to relevant parts of the tree, substantially reducing the search complexity. We demonstrate the effectiveness of our method in daily task-planning experiments and highlight its advantages over using LLMs solely as policies. |
Zirui Zhao · Wee Sun Lee · David Hsu 🔗 |
-
|
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
(
Oral
)
>
link
Real-world environments often have large-scale domains that make classical planning intractable. Large language models (LLMs) have been used as few-shot planning policies due to their commonsense knowledge for solving daily problems. However, LLMs' potential for planning complex tasks is not fully tapped. This paper shows that LLMs can be used as both the commonsense world model and the heuristic policy in search algorithms such as Monte Carlo Tree Search (MCTS). LLM's world model provides a commonsense prior belief of states for MCTS to achieve reasoned decision-making efficiently. The LLM's heuristic policy guides the search to relevant parts of the tree, substantially reducing the search complexity. We demonstrate the effectiveness of our method in daily task-planning experiments and highlight its advantages over using LLMs solely as policies. |
Zirui Zhao · Wee Sun Lee · David Hsu 🔗 |
-
|
Linear diffusion models meet contextual bandits with large action spaces
(
Poster
)
>
link
Efficient exploration is a key challenge in contextual bandits due to the potentially large size of their action space, where uninformed exploration can result in computational and statistical inefficiencies. Fortunately, the rewards of actions are often correlated and this can be leveraged to explore them efficiently. In this work, we capture such correlations using pre-trained linear diffusion models; upon which we design diffusion Thompson sampling (dTS). Both theoretical and algorithmic foundations are developed for dTS, and empirical evaluation also shows its favorable performance. |
Imad Aouali 🔗 |
-
|
Linear diffusion models meet contextual bandits with large action spaces
(
Oral
)
>
link
Efficient exploration is a key challenge in contextual bandits due to the potentially large size of their action space, where uninformed exploration can result in computational and statistical inefficiencies. Fortunately, the rewards of actions are often correlated and this can be leveraged to explore them efficiently. In this work, we capture such correlations using pre-trained linear diffusion models; upon which we design diffusion Thompson sampling (dTS). Both theoretical and algorithmic foundations are developed for dTS, and empirical evaluation also shows its favorable performance. |
Imad Aouali 🔗 |
-
|
Policy-Gradient Training of Language Models for Ranking
(
Poster
)
>
link
Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals.This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks. |
Ge Gao · Jonathan Chang · Claire Cardie · Kianté Brantley · Thorsten Joachims 🔗 |
-
|
Policy-Gradient Training of Language Models for Ranking
(
Oral
)
>
link
Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals.This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks. |
Ge Gao · Jonathan Chang · Claire Cardie · Kianté Brantley · Thorsten Joachims 🔗 |
-
|
Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
(
Poster
)
>
link
Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods --- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories. |
Licong Lin · Yu Bai · Song Mei 🔗 |
-
|
Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
(
Oral
)
>
link
Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable in-context reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically well-understood. In particular, it is unclear which reinforcement-learning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods --- algorithm distillation and decision-pretrained transformers. First, assuming model realizability, we prove the supervised-pretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate near-optimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCB-VI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories. |
Licong Lin · Yu Bai · Song Mei 🔗 |
-
|
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
(
Poster
)
>
link
Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. This stands in contrast to most agents trained with reinforcement learning (RL), which typically learn behaviors from scratch. Therefore, we would like to endow RL agents with a similar ability to leverage contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex RL tasks in Minecraft. We find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features. Moreover, we show that promptable representations extracted from general-purpose VLMs outperform both domain-specific representations and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating and optimizing prompts used by our approach for a given task without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM. |
William Chen · Oier Mees · Aviral Kumar · Sergey Levine 🔗 |
-
|
Vision-Language Models Provide Promptable Representations for Reinforcement Learning
(
Oral
)
>
link
Intelligent beings have the ability to quickly learn new behaviors and tasks by leveraging background world knowledge. This stands in contrast to most agents trained with reinforcement learning (RL), which typically learn behaviors from scratch. Therefore, we would like to endow RL agents with a similar ability to leverage contextual prior information. To this end, we propose a novel approach that uses the vast amounts of general-purpose, diverse, and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data to generate text in response to images and prompts. We initialize RL policies with VLMs by using such models as sources of \textit{promptable representations}: embeddings that are grounded in visual observations and encode semantic features based on the VLM's internal knowledge, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex RL tasks in Minecraft. We find that policies trained on promptable embeddings significantly outperform equivalent policies trained on generic, non-promptable image encoder features. Moreover, we show that promptable representations extracted from general-purpose VLMs outperform both domain-specific representations and instruction-following methods. In ablations, we find that VLM promptability and text generation both are important in yielding good representations for RL. Finally, we give a simple method for evaluating and optimizing prompts used by our approach for a given task without running expensive RL trials, ensuring that it extracts task-relevant semantic features from the VLM. |
William Chen · Oier Mees · Aviral Kumar · Sergey Levine 🔗 |
-
|
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
(
Poster
)
>
link
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt. |
Martin Klissarov · Pierluca D'Oro · Shagun Sodhani · Roberta Raileanu · Pierre-Luc Bacon · Pascal Vincent · Amy Zhang · Mikael Henaff 🔗 |
-
|
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
(
Oral
)
>
link
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt. |
Martin Klissarov · Pierluca D'Oro · Shagun Sodhani · Roberta Raileanu · Pierre-Luc Bacon · Pascal Vincent · Amy Zhang · Mikael Henaff 🔗 |
-
|
Eureka: Human-Level Reward Design via Coding Large Language Models
(
Poster
)
>
link
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed. |
Jason Ma · William Liang · Guanzhi Wang · De-An Huang · Osbert Bastani · Dinesh Jayaraman · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Eureka: Human-Level Reward Design via Coding Large Language Models
(
Oral
)
>
link
Large Language Models (LLMs) have excelled as high-level semantic planners for sequential decision-making tasks. However, harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning, remains an open problem. We bridge this fundamental gap and present Eureka, a human-level reward design algorithm powered by LLMs. Eureka exploits the remarkable zero-shot generation, code-writing, and in-context improvement capabilities of state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over reward code. The resulting rewards can then be used to acquire complex skills via reinforcement learning. Without any task-specific prompting or pre-defined reward templates, Eureka generates reward functions that outperform expert human-engineered rewards. In a diverse suite of 29 open-source RL environments that include 10 distinct robot morphologies, Eureka outperforms human experts on 83% of the tasks, leading to an average normalized improvement of 52%. The generality of Eureka also enables a new gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF), readily incorporating human inputs to improve the quality and the safety of the generated rewards without model updating. Finally, using Eureka rewards in a curriculum learning setting, we demonstrate for the first time, a simulated Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a pen in circles at rapid speed. |
Jason Ma · William Liang · Guanzhi Wang · De-An Huang · Osbert Bastani · Dinesh Jayaraman · Yuke Zhu · Linxi Fan · Animashree Anandkumar 🔗 |
-
|
Language Conditioned Semantic Search Based Policy for Robotic Manipulation Tasks
(
Poster
)
>
link
Solving various robotic manipulation tasks intelligently is a topic of great interest. Traditional reinforcement learning and imitation learning approaches require policy learning utilizing complex strategies which are difficult to generalize well. In this work, we propose a language conditioned semantic search based method in the available demonstration dataset of state-action trajectories to produce an on the fly search-based policy. Here we directly acquire actions from the most similar manipulation trajectories found in the dataset. Our approach surpasses the performance of previous best methods on the CALVIN benchmark and exhibits strong zero-shot adaptation capabilities. This holds great potential for expanding the use of our on the fly search-based policy approach to tasks typically addressed by Imitation Learning or Reinforcement Learning-based policies. |
Jannik Sheikh · Andrew Melnik · Gora Chand Nandi · Robert Haschke 🔗 |
-
|
Language Conditioned Semantic Search Based Policy for Robotic Manipulation Tasks
(
Oral
)
>
link
Solving various robotic manipulation tasks intelligently is a topic of great interest. Traditional reinforcement learning and imitation learning approaches require policy learning utilizing complex strategies which are difficult to generalize well. In this work, we propose a language conditioned semantic search based method in the available demonstration dataset of state-action trajectories to produce an on the fly search-based policy. Here we directly acquire actions from the most similar manipulation trajectories found in the dataset. Our approach surpasses the performance of previous best methods on the CALVIN benchmark and exhibits strong zero-shot adaptation capabilities. This holds great potential for expanding the use of our on the fly search-based policy approach to tasks typically addressed by Imitation Learning or Reinforcement Learning-based policies. |
Jannik Sheikh · Andrew Melnik · Gora Chand Nandi · Robert Haschke 🔗 |
-
|
NexusRaven: A Commercially-Permissive Language Model for Function Calling
(
Poster
)
>
link
The rise of open-source, commercially permissive large language models (LLMs) is revolutionizing generative AI, presenting organizations with enhanced control, minimized data risks, and cost benefits compared to proprietary models. However, in the field of tool use and function-calling LLMs, many open-source models, such as Gorilla and ToolLLAMA, are dependent on proprietary LLMs like GPT-4 for high-quality training data, which often faces legal restrictions for competitive commercial applications. In this paper, we introduce NexusRaven-13B, an open-source LLM designed for function calls. Originating from the CodeLLAMA-13B lineage, NexusRaven-13B employs a unique data curation via multi-step refinement, ensuring high-quality training data without relying on GPT-4 distillation. NexusRaven-13B matches GPT-3.5 in zero-shot function-calling accuracy. When combined with our second core technique, demonstration retrieval augmentation, its performance significantly surpasses GPT-4. The code, model,and demo will be available after the review process. |
Venkat Krishna Srinivasan · Zhen Dong · Banghua Zhu · Brian Yu · Hanzi Mao · Damon Mosk-Aoyama · Kurt Keutzer · Jiantao Jiao · Jian Zhang 🔗 |
-
|
NexusRaven: A Commercially-Permissive Language Model for Function Calling
(
Oral
)
>
link
The rise of open-source, commercially permissive large language models (LLMs) is revolutionizing generative AI, presenting organizations with enhanced control, minimized data risks, and cost benefits compared to proprietary models. However, in the field of tool use and function-calling LLMs, many open-source models, such as Gorilla and ToolLLAMA, are dependent on proprietary LLMs like GPT-4 for high-quality training data, which often faces legal restrictions for competitive commercial applications. In this paper, we introduce NexusRaven-13B, an open-source LLM designed for function calls. Originating from the CodeLLAMA-13B lineage, NexusRaven-13B employs a unique data curation via multi-step refinement, ensuring high-quality training data without relying on GPT-4 distillation. NexusRaven-13B matches GPT-3.5 in zero-shot function-calling accuracy. When combined with our second core technique, demonstration retrieval augmentation, its performance significantly surpasses GPT-4. The code, model,and demo will be available after the review process. |
Venkat Krishna Srinivasan · Zhen Dong · Banghua Zhu · Brian Yu · Hanzi Mao · Damon Mosk-Aoyama · Kurt Keutzer · Jiantao Jiao · Jian Zhang 🔗 |
-
|
Mitigating Generative Agent Social Dilemmas
(
Poster
)
>
link
In social dilemmas, individuals would be better off cooperating but fail to do so due to conflicting interests that discourage cooperation. Existing work on social dilemmas in AI has focused on standard agent design paradigms, most recently in the context of multi-agent reinforcement learning (MARL). However, with the rise of large language models (LLMs), a new design paradigm for AI systems has started to emerge---generative agents, in which actions performed by agents are chosen by prompting LLMs. This paradigm has seen recent success, such as Voyager, a highly capable Minecraft agent. In this work, we perform an initial study of outcomes that arise when deploying generative agents in social dilemmas. To do this, we build a multi-agent Voyager framework with a contracting and judgement mechanism based on formal contracting, which has been effective in mitigating social dilemmas in MARL. We then construct social dilemmas in Minecraft as the testbed for our open-source framework. Finally, we conduct preliminary experiments using our framework to provide evidence that contracting helps improve outcomes for generative agents in social dilemmas. |
Julian Yocum · Phillip Christoffersen · Mehul Damani · Justin Svegliato · Dylan Hadfield-Menell · Stuart J Russell 🔗 |
-
|
Mitigating Generative Agent Social Dilemmas
(
Oral
)
>
link
In social dilemmas, individuals would be better off cooperating but fail to do so due to conflicting interests that discourage cooperation. Existing work on social dilemmas in AI has focused on standard agent design paradigms, most recently in the context of multi-agent reinforcement learning (MARL). However, with the rise of large language models (LLMs), a new design paradigm for AI systems has started to emerge---generative agents, in which actions performed by agents are chosen by prompting LLMs. This paradigm has seen recent success, such as Voyager, a highly capable Minecraft agent. In this work, we perform an initial study of outcomes that arise when deploying generative agents in social dilemmas. To do this, we build a multi-agent Voyager framework with a contracting and judgement mechanism based on formal contracting, which has been effective in mitigating social dilemmas in MARL. We then construct social dilemmas in Minecraft as the testbed for our open-source framework. Finally, we conduct preliminary experiments using our framework to provide evidence that contracting helps improve outcomes for generative agents in social dilemmas. |
Julian Yocum · Phillip Christoffersen · Mehul Damani · Justin Svegliato · Dylan Hadfield-Menell · Stuart J Russell 🔗 |
-
|
In-Context Multi-Armed Bandits via Supervised Pretraining
(
Poster
)
>
link
Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data. |
Fred Zhang · Jiaxin Ye · Zhuoran Yang 🔗 |
-
|
In-Context Multi-Armed Bandits via Supervised Pretraining
(
Oral
)
>
link
Exploring the in-context learning capabilities of large transformer models, this research focuses on decision-making within reinforcement learning (RL) environments, specifically multi-armed bandit problems. We introduce the Reward Weighted Decision-Pretrained Transformer (DPT-RW), a model that uses straightforward supervised pretraining with a reward-weighted imitation learning loss. The DPT-RW predicts optimal actions by evaluating a query state and an in-context dataset across varied tasks. Surprisingly, this simple approach produces a model capable of solving a wide range of RL problems in-context, demonstrating online exploration and offline conservatism without specific training in these areas. A standout observation is the optimal performance of the model in the online setting, despite being trained on data generated from suboptimal policies and not having access to optimal data. |
Fred Zhang · Jiaxin Ye · Zhuoran Yang 🔗 |
-
|
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
(
Poster
)
>
link
We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. |
Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian (Shawn) Ma · Anji Liu · Yitao Liang 🔗 |
-
|
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
(
Oral
)
>
link
We study the problem of building a controller that can follow open-ended instructions in open-world environments. We propose to follow reference videos as instructions, which offer expressive goal specifications while eliminating the need for expensive text-gameplay annotations. A new learning framework is derived to allow learning such instruction-following controllers from gameplay videos while producing a video instruction encoder that induces a structured goal space. We implement our agent GROOT in a simple yet effective encoder-decoder architecture based on causal transformers. We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft SkillForge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis. |
Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian (Shawn) Ma · Anji Liu · Yitao Liang 🔗 |
-
|
Asking Clarifying Questions using Language Models and Probabilistic Reasoning
(
Poster
)
>
link
The ability to ask good, informative clarification questions is crucial for any AI system receiving human users inputs in natural language to be robust and reliable. Previous approaches to learn such interactive inference systems involve costly data collection and fall short in task generalization capability. While recent large language models (LLMs) might seem to be able to solve both problems with their impressive zero-shot learning ability, they turn out to be weak at asking good questions. We introduce an inference-time algorithm that helps LLMs output more informative questions. The algorithm relies on probability distributions defined by prompting an LLM and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm is superior to the baselines with the same underlying LLM. |
Top Piriyakulkij · Volodymyr Kuleshov · Kevin Ellis 🔗 |
-
|
Asking Clarifying Questions using Language Models and Probabilistic Reasoning
(
Oral
)
>
link
The ability to ask good, informative clarification questions is crucial for any AI system receiving human users inputs in natural language to be robust and reliable. Previous approaches to learn such interactive inference systems involve costly data collection and fall short in task generalization capability. While recent large language models (LLMs) might seem to be able to solve both problems with their impressive zero-shot learning ability, they turn out to be weak at asking good questions. We introduce an inference-time algorithm that helps LLMs output more informative questions. The algorithm relies on probability distributions defined by prompting an LLM and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm is superior to the baselines with the same underlying LLM. |
Top Piriyakulkij · Volodymyr Kuleshov · Kevin Ellis 🔗 |
-
|
Pre-Training and Fine-Tuning Generative Flow Networks
(
Poster
)
>
link
Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets. |
Ling Pan · Moksh Jain · Kanika Madan · Yoshua Bengio 🔗 |
-
|
Pre-Training and Fine-Tuning Generative Flow Networks
(
Oral
)
>
link
Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects from a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets. |
Ling Pan · Moksh Jain · Kanika Madan · Yoshua Bengio 🔗 |
-
|
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View.
(
Poster
)
>
link
Recent years have seen a drastic shift of focus on large models trained with simple self-supervised objectives on diverse datasets. These foundational models have become ubiquitous in NLP and vision because they are generally applicable to many downstream tasks. These success stories have sparked various attempts to train similar models for RL problems. Decision Transformers (DT) is one such popular approach that treats the RL problem as a sequence modeling problem, and uses a transformer model for predicting actions. This algorithmic choice, though simple, can have certain limitations when compared to traditional RL algorithms. In this paper, we study one such limitation -- the capability of recombining together pieces of previously seen experience to solve a task never seen before during training. This paper studies this question in the setting of goal-reaching problems. We formalize this desirable property as a form of \emph{stitching} generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen \emph{together} in the training data. Our analysis shows that this sort of generalization is different from \emph{i.i.d.} generalization. This connection between stitching and generalization reveals why we should not expect existing DT like methods to perform stitching, even in the limit of large datasets and models. We experimentally validate this result on carefully constructed datasets. This connection also suggests a simple remedy, the same remedy for improving generalization in supervised learning: data augmentation. We propose a naive \emph{temporal} data augmentation approach and demonstrate that adding it to RL methods based on SL enables them to stitch together experience so that they succeed in navigating between states and goals unseen together during training. |
Raj Ghugare · Matthieu Geist · Glen Berseth · Benjamin Eysenbach 🔗 |
-
|
Closing the Gap between TD Learning and Supervised Learning -- A Generalisation Point of View.
(
Oral
)
>
link
Recent years have seen a drastic shift of focus on large models trained with simple self-supervised objectives on diverse datasets. These foundational models have become ubiquitous in NLP and vision because they are generally applicable to many downstream tasks. These success stories have sparked various attempts to train similar models for RL problems. Decision Transformers (DT) is one such popular approach that treats the RL problem as a sequence modeling problem, and uses a transformer model for predicting actions. This algorithmic choice, though simple, can have certain limitations when compared to traditional RL algorithms. In this paper, we study one such limitation -- the capability of recombining together pieces of previously seen experience to solve a task never seen before during training. This paper studies this question in the setting of goal-reaching problems. We formalize this desirable property as a form of \emph{stitching} generalization: after training on a distribution of (state, goal) pairs, one would like to evaluate on (state, goal) pairs not seen \emph{together} in the training data. Our analysis shows that this sort of generalization is different from \emph{i.i.d.} generalization. This connection between stitching and generalization reveals why we should not expect existing DT like methods to perform stitching, even in the limit of large datasets and models. We experimentally validate this result on carefully constructed datasets. This connection also suggests a simple remedy, the same remedy for improving generalization in supervised learning: data augmentation. We propose a naive \emph{temporal} data augmentation approach and demonstrate that adding it to RL methods based on SL enables them to stitch together experience so that they succeed in navigating between states and goals unseen together during training. |
Raj Ghugare · Matthieu Geist · Glen Berseth · Benjamin Eysenbach 🔗 |
-
|
Confronting Reward Model Overoptimization with Constrained RLHF
(
Poster
)
>
link
Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run. |
Ted Moskovitz · Aaditya Singh · DJ Strouse · Tuomas Sandholm · Russ Salakhutdinov · Anca Dragan · Stephen McAleer 🔗 |
-
|
Confronting Reward Model Overoptimization with Constrained RLHF
(
Oral
)
>
link
Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run. |
Ted Moskovitz · Aaditya Singh · DJ Strouse · Tuomas Sandholm · Russ Salakhutdinov · Anca Dragan · Stephen McAleer 🔗 |