Timezone: »

Workshop
Foundation Models for Decision Making
Mengjiao (Sherry) Yang · Yilun Du · Jack Parker-Holder · Siddharth Karamcheti · Igor Mordatch · Shixiang (Shane) Gu · Ofir Nachum

Sat Dec 03 06:50 AM -- 02:30 PM (PST) @ Room 291 - 292

Humans acquire vision, language, and decision making abilities through years of experience, arguably corresponding to millions of video frames, audio clips, and interactions with the world. Following this data-driven approach, recent foundation models trained on large and diverse datasets have demonstrated emergent capabilities and fast adaptation to a wide range of downstream vision and language tasks (e.g., BERT, DALL-E, GPT-3, CLIP). Meanwhile in the decision making and reinforcement learning (RL) literature, foundation models have yet to fundamentally shift the traditional paradigm in which an agent learns from its own or others’ collected experience, typically on a single-task and with limited prior knowledge. Nevertheless, there has been a growing body of foundation-model-inspired research in decision making that often involves collecting large amounts of interactive data for self-supervised learning at scale. For instance, foundation models such as BERT and GPT-3 have been applied to modeling trajectory sequences of agent experience, and ever-larger datasets have been curated for learning multimodel, multitask, and generalist agents. These works demonstrate the potential benefits of foundation models on a broad set of decision making applications such as autonomous driving, healthcare systems, robotics, goal-oriented dialogue, robotics, and recommendation systems.

Despite early signs of success, foundation models for decision making remain largely underexplored, underutilized, and lacking solid empirical and theoretical grounding. The challenges faced by existing research are as follows:
1. Many traditional decision making benchmarks are (near-)Markovian (i.e., historyless), and this brings the value of sequence modeling into question. The true power of foundation models may require more complex tasks.
2. Decision making tasks are composed of multi-modal data. At minimum, the states (observations), actions, and rewards of a task are each of different types. Moreover, across different tasks, states and actions can be highly distinct (image vs. text observations, discrete vs. continuous actions).
3. Unlike vision and language, decision making agents can further interact with the environment to collect additional experience in conjunction with learning on existing data. How such an interactive component should be integrated with foundation models is not clear.
4. There already exhibits a large gap between theory and practice in decision making. Hastily applying large models to decision making might create an even greater gap.

Goal of the workshop: The goal of this workshop is to bring together the decision making community and the foundation models community in vision and language to confront the challenges in decision making at scale. The workshop will span high-level discussions on how foundation models can help decision making (if at all) and low-level algorithmic differences of decision, vision, and language which might lead to both opportunities or challenges for applying foundation models to decision making. More specific topics will include but are not limited to:
1. Common or distinct properties of vision, language, and decision making tasks that reassure or challenge the value of foundation models in decision making.
2. Introduction or proposals for new benchmarks to facilitate better research for foundation models for decision making.
3. How decision making can benefit from techniques already popular for foundation models, such as autoregressive sequence models, diffusion models, contrastive pretraining, masked autoencoders, prompting, etc.
4. Lessons learned from developing engineering frameworks, datasets and benchmarks, and evaluation protocols for foundation models in vision and language, and how can the decision making community benefit from these lessons.
5. How foundation models relate to the theoretical foundations of sequential decision making.

 Sat 6:50 a.m. - 7:00 a.m. Ofir Nachum: Opening Remarks (In-Person Introduction) 🔗 Sat 7:00 a.m. - 7:15 a.m. Is Conditional Generative Modeling all you need for Decision-Making? (Oral Presentation) 🔗 Sat 7:15 a.m. - 7:30 a.m. Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training (Oral Presentation)  link » 🔗 Sat 7:30 a.m. - 7:45 a.m. VIMA: General Robot Manipulation with Multimodal Prompts (Oral Presentation)  link » 🔗 Sat 7:45 a.m. - 8:00 a.m. Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action (Oral Presentation)  link » 🔗 Sat 8:00 a.m. - 8:30 a.m. Gabriel Barth-Maron: Gato: A Generalist Agent (Invited Talk) 🔗 Sat 8:30 a.m. - 9:00 a.m. Jim Fan: Open-Ended Embodied Agents with Internet-Scale Knowledge (Invited Talk) 🔗 Sat 9:00 a.m. - 9:30 a.m. Leslie P. Kaelbling: What does an intelligent robot need to know? (Invited Talk) 🔗 Sat 9:30 a.m. - 10:00 a.m. Dorsa Sadigh: Learning and Leveraging Foundation Models in Robotics (Invited Talk) 🔗 Sat 11:00 a.m. - 11:15 a.m. REACT: Synergizing Reasoning and Acting in Language Models (Oral Presentation)  link » 🔗 Sat 11:15 a.m. - 11:30 a.m. Generative Pretraining for Black-Box Optimization (Oral Presentation)  link » 🔗 Sat 11:30 a.m. - 11:45 a.m. In-context Reinforcement Learning with Algorithm Distillation (Oral Presentation)  link » 🔗 Sat 11:45 a.m. - 12:00 p.m. Large Language Models Are Human-Level Prompt Engineers (Oral Presentation)  link » 🔗 Sat 12:00 p.m. - 12:30 p.m. Thomas Wolf: Unlocking Foundation Models for Embodied Learning – What Tools Will We Need? (Invited Talk) 🔗 Sat 12:30 p.m. - 1:00 p.m. Machel Reid: On using pre-trained language models for reinforcement learning (Invited Talk) 🔗 Sat 1:00 p.m. - 1:30 p.m. Deepak Pathak: Invited Talk (Invited Talk) 🔗 Sat 1:30 p.m. - 2:00 p.m. Dale Schuurmans: Large Foundation Models and Reinforcement Learning (Invited Talk) 🔗 Sat 2:00 p.m. - 2:30 p.m. Panel Discussion 🔗 - Revealing the Bias in Large Language Models via Reward Structured Questions (Poster)  link » The success of the large language models have been utterly demonstrated in the recent time. Using these models and fine tuning for the specific task at hand results in highly performing models. However, these models also learn biased representations from the data they have been trained on. In particular, several studies recently showed that language models can learn to be biased towards certain genders. Quite recently, several studies tried to eliminate this bias via proposing human feedback included in fine-tuning. In our study we show that by changing the question asked to the language model the log probabilities of the bias measured in the responses changes dramatically. Furthermore, in several cases the language model ends up providing a completely opposite response. The recent language models finetuned on the prior gender bias datasets do not resolve the actual problem, but rather alleviates the problem for the dataset on which the model is fine-tuned. We believe our results might lay the foundation for further alignment and safety problems in large language models. Link » Ezgi Korkmaz 🔗 - Intelligent Variable Selection for Branch \& Bound Methods (Poster)  link » Combinatorial optimization is applied to a wide variety of real-world problems like job scheduling, capacity planning and supply-chain management. These problems are usually modelled as Mixed Integer Programming (MIP) Problems and solved using the Branch and Bound (B\&B) paradigm. Branch and Bound method partitions the solution space (branching) by creating constrained sub-problems (bounding) and explores those subsets of the solution space which are highly likely to produce optimal solutions. The efficiency of the Branch and Bound method in finding optimal solutions is heavily influenced by the variable and node selection heuristics used for branching. In this paper, we propose a novel deep reinforcement learning based variable selection strategy. The proposed solution shows significant improvement over Strong Branching (SB) strategies, which have been traditionally used for variable selection. The solution also outperforms the current state of the art RL-based branching strategies like PPO and DQN. The results of our experiments show that the proposed solution is robust and scalable to different kind of problems. Link » Priya Shanmugasundaram · Saurabh Jha · Sailendu Patra 🔗 - Skill Decision Transformer (Poster)  link » Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem. However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) have shown that by utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling and skill discovery methods to discover a diverse set of \emph{primitive behaviors}, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark. Link » Shyam Sudhakaran · Sebastian Risi 🔗 - PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pretraining (Poster)  link »    Robotics has long been a field riddled with complex systems architectures whose modules and connections, whether traditional or learning-based, require significant human expertise and prior knowledge. Inspired by large pre-trained language models, this work introduces a paradigm for pre-training a general purpose representation that can serve as a starting point for multiple tasks on a given robot. We present the Perception-Action Causal Transformer (PACT), a generative transformer-based architecture that aims to build representations directly from robot data in a self-supervised fashion. Through autoregressive prediction of states and actions over time, our model implicitly encodes dynamics and behaviors for a particular robot. Our experimental evaluation focuses on the domain of mobile agents, where we show that this robot-specific representation can function as a single starting point to achieve distinct tasks such as safe navigation, localization and mapping. We evaluate two form factors: a wheeled robot that uses a LiDAR sensor as perception input (MuSHR), and a simulated agent that uses first-person RGB images (Habitat). We show that finetuning small task-specific networks on top of the larger pretrained model results in significantly better performance compared to training a single model from scratch for all tasks simultaneously, and comparable performance to training a separate large model for each task independently. By sharing a common good-quality representation across tasks we can lower overall model capacity and speed up the real-time deployment of such systems. Link » Rogerio Bonatti · Sai Vemprala · shuang ma · Felipe Vieira Frujeri · Shuhang Chen · Ashish Kapoor 🔗 - SMART: Self-supervised Multi-task pretrAining with contRol Transformers (Poster)  link »    Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected. Link » Yanchao Sun · shuang ma · Ratnesh Madaan · Rogerio Bonatti · Furong Huang · Ashish Kapoor 🔗 - LATTE: LAnguage Trajectory TransformEr (Poster)  link »    Natural language is one of the most intuitive ways to express human intent. However, translating instructions and commands towards robotic motion generation and deployment in the real world is far from being an easy task. The challenge of combining a robot's inherent low-level geometric and kinodynamic constraints with a human's high-level semantic instructions traditionally is solved using task-specific solutions with little generalizability between hardware platforms, often with the use of static sets of target actions and commands. This work instead proposes a flexible language-based framework that allows a user to modify generic robotic trajectories. Our method leverages pre-trained language models (BERT and CLIP) to encode the user's intent and target objects directly from a free-form text input and scene images, fuses geometrical features generated by a transformer encoder network, and finally outputs trajectories using a transformer decoder, without the need of priors related to the task or robot information. We significantly extend the previous work presented in Bucker et al. (2022) by expanding the trajectory parametrization space to 3D and velocity as opposed to just XY movements. In addition, we now train the model to use actual images of the objects in the scene for context (as opposed to textual descriptions), and we evaluate the system in a diverse set of scenarios beyond manipulation, such as aerial and legged robots. Our simulated and real-life experiments demonstrate that our transformer model can successfully follow human intent, modifying the shape and speed of trajectories within multiple environments. Link » A Bucker · Luis Figueredo · Sami Haddadin · Ashish Kapoor · shuang ma · Sai Vemprala · Rogerio Bonatti 🔗 - Build generally reusable agent-environment interaction models (Poster)  link »    This paper tackles the problem of how to pre-train a model and make it generally reusable backbones for downstream task learning. In pre-training, we propose a method that builds an agent-environment interaction model by learning domain invariant successor features from the agent's vast experiences covering various tasks, then discretize them into behavior prototypes which result in an embodied set structure. To make the model generally reusable for downstream task learning, we propose (1) embodied feature projection that retains previous knowledge by projecting the new task's observation-action pair to the embodied set structure and (2) projected Bellman updates which add learning plasticity for the new task setting. We provide preliminary results that show downstream task learning based on a pre-trained embodied set structure can handle unseen changes in task objectives, environmental dynamics and sensor modalities. Link » Jun Jin · Hongming Zhang · Jun Luo 🔗 - Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains (Poster)  link »    Multi-modal foundational models are trained on millions of pairs of natural images and texts, frequently obtained through web-crawling approaches. Although their performance is excellent, these models do not generalize well to other domains, such as medical imaging, especially when these domains do not resemble the centric-like images that can be found on the web. In this study, we assess the ability of the stable diffusion model to generate domain-specific images in the particular case of medical imaging. Based on quantitative and qualitative evaluations of the main components of the stable diffusion pipeline (the variational autoencoder, the U-Net and the text-encoder), we explore several approaches to fine-tune stable diffusion to generate radiological images, which accurately represent the clinical content of conditional text prompts. Our best-performing model improves upon the stable diffusion baseline and can be correctly conditioned to insert an abnormality on a synthetic radiology image. Link » Pierre Chambon · Christian Bluethgen · Curtis Langlotz · Akshay Chaudhari 🔗 - What Makes Certain Pre-Trained Visual Representations Better for Robotic Learning? (Poster)  link » Deep learning for robotics is data-intensive, but collecting high-quality robotics data at scale is prohibitively expensive. One approach to mitigate this is to leverage visual representations pre-trained on relatively abundant non-robotic datasets. So far, existing works have focused on proposing pre-training strategies and assessing them via ablation studies, giving high-level knowledge of how pre-training design choices affect downstream performance. However, the significant gap in data and objective between the two stages motivates a more detailed understanding of what properties of better pre-trained visual representations enable their comparative advantage. In this work, we empirically analyze the representations of robotic manipulation data from several standard benchmarks under a variety of pre-trained models, correlating key metrics of the representations with closed-loop task performance after behavior cloning. We find evidence that suggests our proposed metrics have substantive predictive power for downstream robotic learning. Link » Kyle Hsu · Tyler Lum · Ruohan Gao · Shixiang (Shane) Gu · Jiajun Wu · Chelsea Finn 🔗 - Large Language Models Still Can't Plan (A Benchmark for LLMs on Planning and Reasoning about Change) (Poster)  link »    Recent advances in large language models (LLMs) have transformed the field of natural language processing (NLP). From GPT-3 to PaLM, the state-of-the-art performance on natural language tasks is being pushed forward with every new large language model. Along with natural language abilities, there has been a significant interest in understanding whether such models exhibit reasoning capabilities with the use of reasoning benchmarks. However, even though results are seemingly positive, these benchmarks prove to be simplistic in nature and the performance of LLMs on these benchmarks cannot be used as evidence to support, many a times outlandish, claims being made about LLMs' reasoning capabilities. Further, these only represent a very limited set of simple reasoning tasks and we need to look at more sophisticated reasoning problems if we are to measure the true limits of such LLM-based systems. Motivated by this, we propose an extensible assessment framework to test the capabilities of LLMs on reasoning about actions and change, a central aspect of human intelligence. We provide multiple test cases that are more involved than any of the previously established benchmarks and each test case evaluates a different aspect of reasoning about actions and change. Results on GPT-3 (davinci), Instruct-GPT3 (text-davinci-002) and BLOOM (176B), showcase subpar performance on such reasoning tasks. Link » Karthik Valmeekam · Alberto Olmo · Sarath Sreedharan · Subbarao Kambhampati 🔗 - A Control-Centric Benchmark for Video Prediction (Poster)  link » Video is a promising source of knowledge for embodied agents to learn models of the world's dynamics. Large deep networks have become increasingly effective at modeling complex video data in a self-supervised manner, as evaluated by metrics based on human perceptual similarity or pixel-wise comparison. However, it remains unclear whether current metrics are accurate indicators of performance on downstream tasks. We find empirically that for planning robotic manipulation, existing metrics can be unreliable at predicting execution success. To address this, we propose a benchmark for action-conditioned video prediction in the form of a control benchmark that evaluates a given model for simulated robotic manipulation through sampling-based planning. Our benchmark, Video Prediction for Visual Planning ($VP^2$), includes simulated environments with $11$ task categories and $310$ task instance definitions, a full planning implementation, and training datasets containing scripted interaction trajectories for each task category. A central design goal of our benchmark is to expose a simple interface -- a single forward prediction call -- so it is straightforward to evaluate almost any action-conditioned video prediction model. We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling by analyzing three highly-performant video prediction models, finding that while scale can improve perceptual quality when modelling visually diverse settings, other attributes such as uncertainty awareness can also aid planning performance. Link » Stephen Tian · Chelsea Finn · Jiajun Wu 🔗 - CabiNet: Scaling Neural Collision Detection for Object Rearrangement with Procedural Scene Generation (Poster)  link »    We address the important problem of generalizing robotic rearrangement to clutter without any explicit object models. We first generate over 650K cluttered scenes---orders of magnitude more than prior work---in diverse everyday environments, such as cabinets and shelves. We render synthetic partial point clouds from this data and use it to train our CabiNet model architecture. CabiNet is a collision model that accepts object and scene point clouds and predicts collisions for SE$(3)$ object poses in the scene. Our representation has a fast inference speed of 7$\mu$s/query with nearly 20$\%$ higher performance than baseline approaches in challenging environments. We use this collision model in conjunction with a Model Predictive Path Integral (MPPI) planner to generate collision-free trajectories for picking and placing in clutter. CabiNet also predicts waypoints, computed from the scene’s signed distance field (SDF), that allows the robot to navigate tight spaces during rearrangement. This improves rearrangement performance by nearly 35$\%$ compared to baselines. We systematically evaluate our approach, procedurally generate simulated experiments, and demonstrate that our approach directly transfers to the real world, despite training exclusively in simulation. Robot experiments in completely unknown scenes and objects are shown in the supplementary video. Link » Adithyavairavan Murali · Arsalan Mousavian · Clemens Eppner · Adam Fishman · Dieter Fox 🔗 - Planning With Large Language Models Via Corrective Re-Prompting (Poster)  link »    Extracting knowledge from Large Language Models (LLM) offers a path to designing intelligent, embodied agents that takes advantage of the common sense knowledge present in large language datasets. Related works have queried LLMs with a wide-range of contextual information, such as goals, sensor observations and scene descriptions, to generate high-level action plans for a specific task. In this work, we propose a prompting-based strategy for extracting executable plans from a LLM that leverages a novel and readily-accessible source of information: precondition errors. Our approach assumes that actions are only afforded execution in certain contexts (i.e. implicit preconditions must be met for an action to execute), and that the embodied agent has the ability to determine if the action is not executable in the current context (e.g: a precondition error is present). When an agent is unable to execute an action in a plan, our approach re-prompts the LLM with precondition error information to extract a useful and executable action to achieve the intended goal in the current context. We evaluate our approach in the VirtualHome simulation environment on 88 different tasks and 7 scenes. We evaluate different prompt templates and compare to methods that naively re-sample actions from the LLM. We find that our approach using precondition errors improves the executability and semantic correctness of plans, while also reducing the number of corrective re-prompts for querying actions. Link » Shreyas Sundara Raman · Vanya Cohen · Eric Rosen · Ifrah Idrees · David Paulius · Stefanie Tellex 🔗 - Decision Making as Language Generation (Poster)  link » Decision transformers are a recently proposed approach to offline reinforcement learning that leverages transformer-based auto-regressive sequence models. We discuss challenges associated with fine-tuning a given, pre-trained language model on a decision making task. We propose solutions to these challenges and study their viability on a shortest path problem. We also show how given language model allows us to bring to bear data-centric approaches to improving the model and how it opens up the possibility to treat the decision transformer objective as one task alongside others to perform transfer learning. Link » Roland Memisevic · Sunny P Panchal · Mingu Lee 🔗 - Multi-step Planning for Automated Hyperparameter Optimization with OptFormer (Poster)  link »    As machine learning permeates more industries and models become more expensive and time consuming to train, the need for efficient automated hyperparameter optimization (HPO) has never been more pressing. Multi-step planning based approaches to hyperparameter optimization promise improved efficiency over myopic alternatives by more effectively balancing out exploration and exploitation. However, the potential of these approaches has not been fully realized due to their technical complexity and computational intensity. In this work, we leverage recent advances in Transformer-based, natural-language-interfaced hyperparameter optimization to circumvent these barriers. We build on top of the recently proposed OptFormer which casts both hyperparameter suggestion and target function approximation as autoregressive generation thus making planning via rollouts simple and efficient. We conduct extensive exploration of different strategies for performing multi-step planning on top of the OptFormer model to highlight its potential for use in constructing non-myopic HPO strategies. Link » Lucio M Dery · Abram Friesen · Nando de Freitas · Marc'Aurelio Ranzato · Yutian Chen 🔗 - A Mixture-of-Expert Approach to RL-based Dialogue Management (Poster)  link »    Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the word-level, and thus, have to deal with a combinatorially complex action space even for a medium-size vocabulary. As a result, they struggle to produce a successful and engaging dialogue even if they are warm-started with a pre-trained LM. To address this issue, we develop a RL-based DM using a novel mixture of expert language model (MoE-LM) that consists of (i) a LM capable of learning diverse semantics for conversation histories, (ii) a number of specialized LMs (or experts) capable of generating utterances corresponding to a particular attribute or personality, and (iii) a RL-based DM that performs dialogue planning with the utterances generated by the experts. Our MoE approach provides greater flexibility to generate sensible utterances with different intents and allows RL to focus on conversational-level DM. We compare it with SOTA baselines on open-domain dialogues and demonstrate its effectiveness both in terms of the diversity and sensibility of the generated utterances and the overall DM performance. Link » Yinlam Chow · Azamat Tulepbergenov · Ofir Nachum · Dhawal Gupta · Moonkyung Ryu · Mohammad Ghavamzadeh · Craig Boutilier 🔗 - Foundation Models for Semantic Novelty in Reinforcement Learning (Poster)  link »    Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-trained CLIP embeddings without any fine-tuning or learning on the target RL task. We demonstrate that CLIP-based intrinsic rewards can drive exploration towards semantically meaningful states and outperform state-of-the-art methods in challenging sparse-reward procedurally-generated environments. Link » Tarun Gupta · Peter Karkus · Tong Che · Danfei Xu · Marco Pavone 🔗 - Large Language Models Are Human-Level Prompt Engineers (Poster)  link »    By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 21/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Link » Yongchao Zhou · Andrei Muresanu · Ziwen Han · Silviu Pitis · Harris Chan · Keiran Paster · Jimmy Ba 🔗 - Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks (Poster)  link » Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function. Link » Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 🔗 - Return Augmentation gives Supervised RL Temporal Compositionality (Poster)  link »    Offline Reinforcement Learning (RL) methods that use supervised learning or sequence modeling (e.g., Decision Transformer) work by training a return-conditioned policy. A fundamental limitation of these approaches, as compared to value-based methods, is that they have trouble generalizing to behaviors that have a higher return than what was seen at training. Value-based offline-RL algorithms like CQL use bootstrapping to combine training data from multiple trajectories to learn strong behaviors from sub-optimal data. We set out to endow RL via Supervised Learning (RvS) methods with this form of temporal compositionality. To do this, we introduce SuperB, a dynamic programming algorithm for data augmentation that augments the returns in the offline dataset by combining rewards from intersecting trajectories. We show theoretically that SuperB can improve sample complexity and enable RvS to find optimal policies in cases where it previously fell behind the performance of value-based methods. Empirically, we find that SuperB improves the performance of RvS in several offline RL environments, surpassing the prior state-of-the-art RvS agents in AntMaze by orders of magnitude and offering performance competitive with value-based algorithms on the D4RL-gym tasks. Link » Keiran Paster · Silviu Pitis · Sheila McIlraith · Jimmy Ba 🔗 - Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes (Poster)  link » The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches. Link » Aviral Kumar · Rishabh Agarwal · XINYANG GENG · George Tucker · Sergey Levine 🔗 - Pre-Training for Robots: Leveraging Diverse Multitask Data via Offline Reinforcement Learning (Poster)  link » Recent progress in deep learning highlights the tremendous potential of utilizing diverse datasets for achieving effective generalization and makes it enticing to consider leveraging broad datasets for attaining more robust generalization in robotic learning as well. However, in practice we likely will want to learn a new skill in a new environment that is unlikely to be contained in the prior data. Therefore we ask: how can we leverage existing diverse offline datasets in combination with small amounts of task-specific data to solve new tasks, while still enjoying the generalization benefits of training on large amounts of data? In this paper, we demonstrate that end-to-end offline RL can be an effective approach for doing this, without the need for any representation learning or vision-based pre-training. We present pre-training for robots (PTR), a framework based on offline RL that attempts to effectively learn new tasks by combining pre-training on existing robotic datasets with rapid fine-tuning on a new task, with as a few as 10 demonstrations. At its core, PTR applies an existing offline RL method such as conservative Q-learning (CQL), but extends it to include several crucial design decisions that enable PTR to actually work and outperform a variety of prior methods. To the best of our knowledge, PTR is the first offline RL method that succeeds at learning new tasks in a new domain on a real WidowX robot with as few as 10 task demonstrations, by effectively leveraging an existing dataset of diverse multi-task robot data collected in a variety of toy kitchens. We present an accompanying overview video at https://www.youtube.com/watch?v=yAWgyLJD5lY&ab_channel=PTRICLR Link » Aviral Kumar · Anikait Singh · Frederik Ebert · Yanlai Yang · Chelsea Finn · Sergey Levine 🔗 - Offline Reinforcement Learning from Heteroskedastic Data Via Support Constraints (Poster)  link » Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. We show that existing popular offline RL methods based on distribution constraints fail to learn from data with such non-uniform change in the variability of demonstrated behaviors, often due to the requirement to stay close to the behavior policy to the same extent across the state space. We demonstrate this failure mode both theoretically and experimentally. Ideally, the learned policy should be free to choose per-state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning and show that support constraints emerge when doing so. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method CQL (ReDS) is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation. Link » Anikait Singh · Aviral Kumar · Quan Vuong · Yevgen Chebotar · Sergey Levine 🔗 - Planning with Large Language Models for Code Generation (Poster)  link » Existing large language model-based code generation pipelines typically use beam search or sampling algorithms during the decoding process. Although the programs they generate achieve high token-matching-based scores, they often fail to compile or generate incorrect outputs. The main reason is that conventional Transformer decoding algorithms may not be the best choice for code generation. In this work, we propose a novel Transformer decoding algorithm, Planning-Guided Transformer Decoding (PG-TD), that uses a planning algorithm to do lookahead search and guide the Transformer to generate better programs. Specifically, instead of simply optimizing the likelihood of the generated sequences, the Transformer makes use of a planner that generates complete programs and tests them on public test cases. The Transformer can therefore make more informed decisions and generate tokens that will eventually lead to higher-quality programs. We also design a mechanism that shares information between the Transformer and the planner to make our algorithm computationally efficient. We empirically evaluate our framework with several large language models as backbones on public coding challenge benchmarks, showing that 1) it can generate programs that consistently achieve higher performance compared with competing baseline methods; 2) it enables controllable code generation, such as concise codes and highly-commented codes by optimizing modified objective. Link » Shun Zhang · Zhenfang Chen · Yikang Shen · Mingyu Ding · Josh Tenenbaum · Chuang Gan 🔗 - Learning Control by Iterative Inversion (Poster)  link » We formulate learning for control as an inverse problem - inverting a dynamical system to give the actions which yield desired behavior. The key challenge in this formulation is a distribution shift in the inputs to the function to be inverted - the learning agent can only observe the forward mapping (its actions' consequences) on trajectories that it can execute, yet must learn the inverse mapping for inputs-outputs that correspond to a different, desired behavior. We propose a general recipe for inverse problems with a distribution shift that we term $\textit{iterative inversion} - learn the inverse mapping under the current input distribution (policy), then use it on the desired output samples to obtain a new input distribution, and repeat.As we show, iterative inversion can converge to the desired inverse mapping, but under rather strict conditions on the mapping itself.We next apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories (without actions), and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. We find that constantly adding the demonstrated trajectory embeddings as input to the policy when generating trajectories to imitate, a-la iterative inversion, we effectively steer the learning towards the desired trajectory distribution. To the best of our knowledge, this is the first exploration of learning control from the viewpoint of inverse problems, and the main advantage of our approach is simplicity - it does not require rewards, and only employs supervised learning, which can be easily scaled to use state-of-the-art trajectory embedding techniques and policy representations. Indeed, with a VQ-VAE embedding, and a transformer-based policy, we demonstrate non-trivial continuous control on several tasks. Further, we report an improved performance on imitating diverse behaviors compared to reward based methods. Link » Gal Leibovich · Guy Jacob · Or Avner · Gal Novik · Aviv Tamar 🔗 - Multi-Environment Pretraining Enables Transfer to Action Limited Datasets (Poster) link » Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with the logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, even when using annotated datasets equivalent to only 12 minutes of gameplay. Link » David Venuto · Mengjiao (Sherry) Yang · Pieter Abbeel · Doina Precup · Igor Mordatch · Ofir Nachum 🔗 - Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task (Poster) link » We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. Further, we propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness. Link » Jannik Kossen · Cătălina Cangea · Eszter Vértes · Andrew Jaegle · Viorica Patraucean · Ira Ktena · Nenad Tomasev · Danielle Belgrave 🔗 - Foundation Models for History Compression in Reinforcement Learning (Poster) link » Agents interacting under partial observability require access to past observations via a memory mechanism in order to approximate the true state of the environment.Recent work suggests that leveraging language as abstraction provides benefits for creating a representation of past events.History Compression via Language Models (HELM) leverages a pretrained Language Model (LM) for representing the past. It relies on a randomized attention mechanism to translate environment observations to token embeddings.In this work, we show that the representations resulting from this attention mechanism can collapse under certain conditions. This causes blindness of the agent to subtle changes in the environment that may be crucial in solving a certain task. We propose a solution to this problem consisting of two parts. First, we improve upon HELM by substituting the attention mechanism with a feature-wise centering-and-scaling operation. Second, we take a step toward semantic history compression by leveraging foundation models, such as CLIP, to encode observations, which further improves performance. By combining foundation models, our agent is able to solve the challenging MiniGrid-Memory environment.Surprisingly, however, our experiments suggest that this is not due to the semantic enrichment of the representation presented to the LM, but rather due to the discriminative power provided by CLIP. Link » Fabian Paischer · Thomas Adler · Andreas Radler · Markus Hofmarcher · Sepp Hochreiter 🔗 - Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks (Poster) link » Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone. Link » Albert Yu · Raymond Mooney 🔗 - How crucial is Transformer in Decision Transformer? (Poster) link » Decision Transformer (DT) is a recently proposed architecture for Reinforcement Learning that frames the decision-making process as an auto-regressive sequence modeling problem and uses a Transformer model to predict the next action in a sequence of states, actions, and rewards. In this paper, we analyze how crucial the Transformer model is in the complete DT architecture. Namely, we replace the Transformer by an LSTM model while keeping the other parts unchanged to obtain what we call a Decision LSTM model. We compare it to the Decision Transformer on continuous control tasks, including pendulum swing-up and stabilization tasks in simulation and on physical hardware. Our experiments show that Decision Transformer struggles with stabilization tasks, such as inverted pendulum and Furuta pendulum stabilization. On the other hand, the proposed Decision LSTM is able to achieve expert-level performance on these tasks, in addition to learning a swing-up controller on the real system. These results indicate that the strength of the Decision Transformer may lie in the overall sequential modeling architecture and not in the Transformer per se. Therefore, a further investigation into the effects of employing other sequence models in place of the Transformer is desirable. Link » Max Siebenborn · Boris Belousov · Junning Huang · Jan Peters 🔗 - Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning (Poster) link » The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics. Link » Baiting Zhu · Meihua Dang · Aditya Grover 🔗 - Is Conditional Generative Modeling all you need for Decision-Making? (Poster) link » Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional generative model, we avoid the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional generative models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making. Link » Anurag Ajay · Yilun Du · Abhi Gupta · Josh Tenenbaum · Tommi Jaakkola · Pulkit Agrawal 🔗 - Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement Learning (Poster) link » Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without the possibility of additional online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting. The restricted dataset induces subjective uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. Moreover, inherent system stochasticity further increases uncertainty and aggravates the offline RL problem, preventing the agent from learning an optimal policy. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones.In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk.We integrate MPT into the agent's decision-making process to present a simple-yet-highly-effective risk-aware planning algorithm for offline RL.Our algorithm allows us to systematically account for the estimated quality of specific actions and their estimated risk due to the uncertainty.We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner for offline RL tasks, maximizing the return while significantly reducing the variance. Link » Dan Elbaz · Gal Novik · Oren Salzman 🔗 - In-Context Policy Iteration (Poster) link » This work presents In-Context Policy Iteration, an algorithm for performing Reinforcement Learning (RL), in-context, using foundation models. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the “few-shot” quality that made in-context learning attractive to begin with. In this work, we present an algorithm, ICPI, that learns to perform RL tasks without expert demonstrations or gradients. Instead we present a policy-iteration method in which the prompt content is the entire locus of learning. ICPI iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our algorithm using Codex Chen et al. (2021b), a language model with no prior knowledge of the domains on which we evaluate it. Link » Ethan Brooks · Logan Walls · Richard L Lewis · Satinder Singh 🔗 - In-context Reinforcement Learning with Algorithm Distillation (Poster) link » We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data. Link » Michael Laskin · Luyu Wang · Junhyuk Oh · Emilio Parisotto · Stephen Spencer · Richie Steigerwald · DJ Strouse · Steven Hansen · Angelos Filos · Ethan Brooks · Maxime Gazeau · Himanshu Sahni · Satinder Singh · Volodymyr Mnih 🔗 - Contextual Transformer for Offline Meta Reinforcement Learning (Poster) link » Recently, the pretrain-tuning paradigm in large-scale sequence models has made significant progress in Natural Language Processing and Computer Vision. However, such a paradigm is still hindered by intractable challenges in Reinforcement Learning (RL), including the lack of self-supervised large-scale pretraining methods based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can help sequence-modeling-based offline Reinforcement Learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional generation. As such, we can pretrain a model on the offline dataset with supervised loss and learn a prompt to guide the policy to play the desired actions. Secondly, we extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT), which leverages the context among different tasks as the prompt to improve the performance on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark; the results validate the strong performance, high computation efficiency, and generality of our methods. Link » Runji Lin · Ye Li · Xidong Feng · Zhaowei Zhang · XIAN HONG WU FUNG · Haifeng Zhang · Jun Wang · Yali Du · Yaodong Yang 🔗 - Generative Pretraining for Black-Box Optimization (Poster) link » Many problems in science and engineering involve optimizing an expensive black-box function over a high-dimensional space. For such black-box optimization (BBO) problems, we typically assume a small budget for online function evaluations, but also often have access to a fixed, offline dataset for pretraining. Prior approaches seek to utilize the offline data to approximate the function or its inverse but are not sufficiently accurate far from the data distribution. We propose BONET, a generative framework for pretraining a novel black-box optimizer using offline datasets. In BONET, we train an autoregressive model on fixed-length trajectories derived from an offline dataset. We design a sampling strategy to synthesize trajectories from offline data using a simple heuristic of rolling out monotonic transitions from low-fidelity to high-fidelity samples. Empirically, we instantiate BONET using a causally masked transformer and evaluate it on Design-Bench, where we rank the best on average, outperforming state-of-the-art baselines Link » Siddarth Krishnamoorthy · Satvik Mashkaria · Aditya Grover 🔗 - Understanding Hindsight Goal Relabeling Requires Rethinking Divergence Minimization (Poster) link » Hindsight goal relabeling has become a foundational technique for multi-goal reinforcement learning (RL). The idea is quite simple: any arbitrary trajectory can be seen as an expert demonstration for reaching the trajectory's end state. Intuitively, this procedure trains a goal-conditioned policy to imitate a sub-optimal expert. However, this connection between imitation and hindsight relabeling is not well understood. Modern imitation learning algorithms are described in the language of divergence minimization, and yet it remains an open problem how to recast hindsight goal relabeling into that framework. In this work, we develop a unified objective for goal-reaching that explains such a connection, from which we can derive goal-conditioned supervised learning (GCSL) and the reward function in hindsight experience replay (HER) from first principles. Experimentally, we find that despite recent advances in goal-conditioned behaviour cloning (BC), multi-goal Q-learning can still outperform BC-like methods; moreover, a vanilla combination of both actually hurts model performance. Under our framework, we study when BC is expected to help, and empirically validate our findings. Our work further bridges goal-reaching and generative modeling, illustrating the nuances and new pathways of extending the success of generative models to RL. Link » Lunjun Zhang · Bradly Stadie 🔗 - REACT: Synergizing Reasoning and Acting in Language Models (Poster) link » While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Link » Shunyu Yao · Jeffrey Zhao · Dian Yu · Izhak Shafran · Karthik Narasimhan · Yuan Cao 🔗 - ConserWeightive Behavioral Cloning for Reliable Offline Reinforcement Learning (Poster) link » The goal of offline reinforcement learning (RL) is to learn near-optimal policies from static logged datasets, thus sidestepping expensive online interactions. Behavioral cloning (BC) provides a straightforward solution to offline RL by mimicking offline trajectories via supervised learning. Recent advances~\cite{chen2021decision, janner2021offline, emmons2021rvs} have shown that by conditioning on desired future returns, BC can perform competitively to their value-based counterparts, while enjoying much more simplicity and training stability. However, the distribution of returns in the offline dataset can be arbitrarily skewed and suboptimal, which poses a unique challenge for conditioning BC on expert returns at test time. We propose ConserWeightive Behavioral Cloning (\name), a simple and effective method for improving the performance of conditional BC for offline RL with two key components: trajectory weighting and conservative regularization. Trajectory weighting addresses the bias-variance tradeoff in conditional BC and provides a principled mechanism to learn from both low return trajectories (typically plentiful) and high return trajectories (typically few). Further, we analyze the notion of conservatism in existing BC methods, and propose a novel conservative regularizer that explicitly encourages the policy to stay close to the data distribution. The regularizer helps achieve more reliable performance, and removes the need for ad-hoc tuning of the conditioning value during evaluation. We instantiate \name{} in the context of Reinforcement Learning via Supervised Learning (RvS)~\cite{emmons2021rvs} and Decision Transformer (DT)~\citep{chen2021decision}, and empirically show that it significantly boosts the performance and stability of prior methods on various offline RL benchmarks. Link » Tung Nguyen · Qinqing Zheng · Aditya Grover 🔗 - Skill Acquisition by Instruction Augmentation on Offline Datasets (Poster) link » In recent years, much progress has been made in learning robotic manipulation policies that follow natural language instructions. Commonly, such methods learn from a corpora of robot-language data that was either collected with specific tasks in mind or expensively re-labelled by humans with rich language descriptions in hindsight. Recently, large-scale pretrained vision-language models like CLIP have been applied to robotics in the form of learning representations and planners. Can these pretrained models also be used to cheaply impart internet-scale knowledge onto offline datasets, providing access to skills that were not reflected in ground truth labels? To accomplish this, we introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL): we utilize semi-supervised language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data and then train language-conditioned policies on the augmented datasets. This method enables cheaper acquisition of useful language descriptions compared to expensive human labels, allowing for more efficient label coverage of large-scale datasets. We apply DIAL to a challenging real-world robotic manipulation domain, enabling imitation learning policies to acquire new capabilities and generalize to 80 novel instructions unseen in the original dataset. Link » Ted Xiao · Harris Chan · Pierre Sermanet · Ayzaan Wahid · Anthony Brohan · Karol Hausman · Sergey Levine · Jonathan Tompson 🔗 - On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning (Poster) link » Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By proper pretraining and concurrent cross-task online fine-tuning, we achieve substantial improvements over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 73% in some instances. Link » yifan xu · Nicklas Hansen · Zirui Wang · Yung-Chieh Chan · Hao Su · Zhuowen Tu 🔗 - CLaP: Conditional Latent Planners for Offline Reinforcement Learning (Poster) link » Recent work has formulated offline reinforcement learning (RL) as a sequence modeling problem, benefiting from the simplicity and scalability of the Transformer architecture. However, sequence models struggle to model trajectories that are long-horizon or involve complicated environment dynamics. We propose CLaP (Conditional Latent Planners) to learn a simple goal-conditioned latent space from offline agent behavior, and incrementally decode good actions from a latent plan. We evaluate our method on continuous control domains from the D4RL benchmark. Compared to non-sequential models and return-conditioned sequential models, CLaP shows competitive if not better performance across continuous control tasks. It particularly does better in environments with complex transition dynamics with up to$+149.8\%performance increase. Our results suggest that decision-making is easier with simplified latent dynamics that models behavior as being goal-conditioned. Link » Harry Shin · Rose Wang 🔗 - Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action (Poster) link » Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We develop a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data. We instantiate LM-Nav on a real-world mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions. Link » Dhruv Shah 🔗 - Deep Transformer Q-Networks for Partially Observable Reinforcement Learning (Poster) link » Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches. Link » Kevin Esslinger · Robert Platt · Christopher Amato 🔗 - Control Graph as Unified IO for Morphology-Task Generalization (Poster) link » The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce control graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a control graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology task generalization. Link » Hiroki Furuta · Yusuke Iwasawa · Yutaka Matsuo · Shixiang (Shane) Gu 🔗 - Hyper-Decision Transformer for Efficient Online Policy Adaptation (Poster) link » Decision Transformers (DT) have demonstrated strong performances in offline reinforcement learning settings, but quickly adapting to unseen novel tasks remains challenging. To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data- and parameter-efficient manner. To achieve such a goal, we propose to augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly. This initialization enables HDT to efficiently adapt to novel tasks by only fine-tuning the adaptation module. We validate HDT's generalization capability on object manipulation tasks. We find that with a single expert demonstration and fine-tuning only 0.5% of DT parameters, HDT adapts faster to unseen tasks than fine-tuning the whole DT model. Finally, we explore a more challenging setting where expert actions are not available, and we show that HDT outperforms state-of-the-art baselines in terms of task success rates by a large margin. Demos are available on our project page: https://sites.google.com/view/hdtforiclr2023/home. Link » Mengdi Xu · Yuchen Lu · Yikang Shen · Shun Zhang · DING ZHAO · Chuang Gan 🔗 - Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training (Poster) link » Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce\textbf{V}$alue-$\textbf{I}$mplicit$\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories. Link » Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang 🔗 - VIMA: General Robot Manipulation with Multimodal Prompts (Poster) link » Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9x task success rate given the same training data. With 10x less training data, VIMA still performs 2.7x better than the top competing approach. Video demos are available at https://iclr3081.github.io/. Link » Yunfan Jiang · Agrim Gupta · Zichen Zhang · Guanzhi Wang · Yongqiang Dou · Yanjun Chen · Fei-Fei Li · Anima Anandkumar · Yuke Zhu · Linxi Fan 🔗 - Constrained MDPs can be Solved by Eearly-Termination with Recurrent Models (Poster) link » Safety is one of the crucial concerns for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential danger, human tends to stop immediately and rarely learns to behave safely in danger. Moreover, the off-policy learning nature of humans guarantees high learning efficiency in risky tasks. Motivated by human learning, we introduce a Minimalist Off-Policy Approach (MOPA) to address Safe-RL problem. We first define the Early Terminated MDP (ET-MDP) as a special type of MDPs that has the same optimal value function as its CMDP counterpart. An off-policy learning algorithm MOPA based on recurrent models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP. Experiments on various Safe-RL tasks show a substantial improvement over previous methods that directly solve CMDP, in terms of higher asymptotic performance and better learning efficiency. Link » Hao Sun · Ziping Xu · Meng Fang · Zhenghao Peng · Taiyi Wang · Bolei Zhou 🔗 - Supervised Q-Learning can be a Strong Baseline for Continuous Control (Poster) link » Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function$Qglobally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency. Link » Hao Sun · Ziping Xu · Taiyi Wang · Meng Fang · Bolei Zhou 🔗 - Solving PDDL Planning Problems with Pretrained Large Language Models (Poster) link » We study few-shot prompting of pretrained large language models (LLMs) towards solving PDDL planning problems. We are interested in two questions: (1) To what extent can LLMs solve PDDL planning problems on their own? (2) How and to what extent can LLMs be used to guide AI planners? Recent work by Valmeekam et al. (2022) presents negative evidence for (1) in the classic blocks world domain. We confirm this finding, but expand the inquiry to 18 domains and find more mixed results with a few clear successes. For (2), we propose a simple mechanism for using good-but-imperfect LLM outputs to aid a heuristic-search planner. We also find that the LLM performance is due not only to syntactic pattern matching, but also to its commonsense understanding of English terms that appear in the PDDL. Link » Tom Silver · Varun Hariprasad · Reece Shuttleworth · Nishanth Kumar · Tomás Lozano-Pérez · Leslie Kaelbling 🔗 - Collaborating with language models for embodied reasoning (Poster) link » Reasoning in a complex and ambiguous embodied environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance. Link » Ishita Dasgupta · Christine Kaeser-Chen · Kenneth Marino · Arun Ahuja · Sheila Babayan · Felix Hill · Rob Fergus 🔗 - Elicitation Inference Optimization for Multi-Principal-Agent Alignment (Poster) link » In multi-principal agent alignment scenarios spanning governance, markets, diplomacy, and AGI, it is unfeasible to elicit every principal's view on all perspectives relevant to agent decisions. Elicitation inference optimization (EIO) aims to minimize then$elicitations needed to approximate$N$principal's views across$K$perspectives. In this work, we demonstrate an EIO approach where data efficiency ($NK/n$) increases with scale. We introduce STUMP: an elicitation inference model which integrates an LLM with a latent factor model to enable learning transfer across samples, contexts, and languages. Then, we characterize STUMP's performance on a set of elicitation primitives from which scalable elicitation (sampling) protocols can be constructed. Building from these results, we design and demonstrate two scalable elicitation protocols for STUMP where data efficiency grows boundlessly, scaling like$O(n)$in the number of elicitations$n\$. This makes it possible to obtain complex, high-dimensional preference signals spanning principal populations at any scale. Link » Andrew Konya · Yeping L Qiu · Michael Varga · Aviv Ovadya 🔗 - LMPriors: Pre-Trained Language Models as Task-Specific Priors (Poster)  link »    Particularly in low-data regimes, an outstanding challenge in machine learning is developing principled techniques for augmenting our models with suitable priors. This is to encourage them to learn in ways that are compatible with our understanding of the world. But in contrast to generic priors such as shrinkage or sparsity, we draw inspiration from the recent successes of large-scale language models (LMs) to construct \emph{task-specific priors} distilled from the rich knowledge of LMs. Our method, Language Model Priors (LMPriors), incorporates auxiliary natural language metadata about the task---such as variable names and descriptions---to encourage downstream model outputs to be consistent with the LM's common-sense reasoning based on the metadata. Empirically, we demonstrate that LMPriors improve model performance in settings where such natural language descriptions are available, and perform well on several tasks that benefit from such prior knowledge, such as feature selection, causal inference, and safe reinforcement learning. Link » Kristy Choi · Chris Cundy · Sanjari Srivastava · Stefano Ermon 🔗