Timezone: »

Workshop
3rd Offline Reinforcement Learning Workshop: Offline RL as a "Launchpad"
Aviral Kumar · Rishabh Agarwal · Aravind Rajeswaran · Wenxuan Zhou · George Tucker · Doina Precup · Aviral Kumar

Fri Dec 02 06:20 AM -- 03:30 PM (PST) @ Room 291 - 292

While offline RL focuses on learning solely from fixed datasets, one of the main learning points from the previous edition of offline RL workshop was that large-scale RL applications typically want to use offline RL as part of a bigger system as opposed to being the end-goal in itself. Thus, we propose to shift the focus from algorithm design and offline RL applications to how offline RL can be a launchpad , i.e., a tool or a starting point, for solving challenges in sequential decision-making such as exploration, generalization, transfer, safety, and adaptation. Particularly, we are interested in studying and discussing methods for learning expressive models, policies, skills and value functions from data that can help us make progress towards efficiently tackling these challenges, which are otherwise often intractable.

Submission site: https://openreview.net/group?id=NeurIPS.cc/2022/Workshop/Offline_RL. The submission deadline is September 25, 2022 (Anywhere on Earth). Please refer to the submission page for more details.

 Fri 6:20 a.m. - 6:30 a.m. Opening Remarks 🔗 Fri 6:30 a.m. - 7:00 a.m. Offline RL in the context of "Collect and Infer" (Martin Riedmiller) (Invited Talk) 🔗 Fri 7:00 a.m. - 7:10 a.m. Efficient Planning in a Compact Latent Action Space (Contributed Talk) 🔗 Fri 7:10 a.m. - 7:20 a.m. Control Graph as Unified IO for Morphology-Task Generalization (Contributed Talk) 🔗 Fri 7:20 a.m. - 7:30 a.m. Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training (Contributed Talk) 🔗 Fri 7:35 a.m. - 8:05 a.m. AV2.0: Learning to Drive at a Global Scale (Alex Kendall) (Invited Talk) 🔗 Fri 8:05 a.m. - 9:10 a.m. Poster Session 1 (Poster Session) 🔗 Fri 9:10 a.m. - 9:40 a.m. Learning from Suboptimal Demonstrations with No Rewards (Dorsa Sadigh) (Invited Talk) 🔗 Fri 9:40 a.m. - 10:30 a.m. Break 🔗 Fri 10:45 a.m. - 11:30 a.m. Panel Discussion 1 - Applications (Panel Discussion)    Kee-Eung Kim (Remote), Vijay Badrinarayanan (Remote), Taylor Killian (in-person), Tony Jebara (in-person) 🔗 Fri 11:30 a.m. - 11:40 a.m. Choreographer: Learning and Adapting Skills in Imagination (Contributed Talk) 🔗 Fri 11:40 a.m. - 11:50 a.m. Provable Benefits of Representational Transfer in Reinforcement Learning (Contributed Talk) 🔗 Fri 11:50 a.m. - 12:00 p.m. Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning (Contributed Talk) 🔗 Fri 12:00 p.m. - 1:00 p.m. Poster Session 2 (Poster Session) 🔗 Fri 1:00 p.m. - 1:30 p.m. Reinforcement Learning and LTV at Spotify (Tony Jebara) (Invited Talk) 🔗 Fri 1:30 p.m. - 2:00 p.m. Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient (Wen Sun) (Invited Talk) 🔗 Fri 2:00 p.m. - 3:00 p.m. Panel Discussion 2 - Research (Panel Discussion)    Martha White (remote), Chelsea Finn (in-person), Wen Sun (in-person), Vincent Vanhoucke (remote) 🔗 Fri 3:00 p.m. - 3:30 p.m. Identification of Dead-ends in Safety-Critical Offline RL (Talyor Killian) (Invited Talk) 🔗 - Agent-Controller Representations: Principled Offline RL with Rich Exogenous Information (Poster)  link » Learning to control an agent from data collected offline in a rich pixel-based visual observation space is vital for real-world applications of reinforcement learning (RL). A major challenge in this setting is the presence of input information that is hard to model and irrelevant to controlling the agent. This problem has been approached by the theoretical RL community through the lens of exogenous information, i.e, any control-irrelevant information contained in observations. For example, a robot navigating in busy streets needs to ignore irrelevant information, such as other people walking in the background, textures of objects, or birds in the sky. In this paper, we focus on the setting with visually detailed exogenous information, and introduce new offline RL benchmarks offering the ability to study this problem. We find that contemporary representation learning techniques can fail on datasets where the noise is a complex and time dependent process, which is prevalent in practical applications. To address these, we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO). Despite being simple and requiring no reward, we show theoretically and empirically that the representation created by this objective greatly outperforms baselines. Link » Riashat Islam · Manan Tomar · Alex Lamb · Hongyu Zang · Yonathan Efroni · Dipendra Misra · Aniket Didolkar · Xin Li · Harm Van Seijen · Remi Tachet des Combes · John Langford 🔗 - Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks (Poster)  link » Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-understood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function. Link » Jesse Farebrother · Joshua Greaves · Rishabh Agarwal · Charline Le Lan · Ross Goroshin · Pablo Samuel Castro · Marc Bellemare 🔗 - Confidence-Conditioned Value Functions for Offline Reinforcement Learning (Poster)  link » Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains. Link » Joey Hong · Aviral Kumar · Sergey Levine 🔗 - Efficient Deep Reinforcement Learning Requires Regulating Statistical Overfitting (Poster)  link » Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that statistical overfitting on the temporal-difference (TD) error is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the amount of statistical overfitting. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on a notion of validation temporal-difference error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the statistical overfitting issue is effective across state-based DMC and Gym tasks. Link » Qiyang Li · Aviral Kumar · Ilya Kostrikov · Sergey Levine 🔗 - Domain Generalization for Robust Model-Based Offline RL (Poster)  link »    Existing offline reinforcement learning (RL) algorithms typically assume that training data is either: 1) generated by a known policy, or 2) of entirely unknown origin. We consider multi-demonstrator offline RL, a middle ground where we know which demonstrators generated each dataset, but make no assumptions about the underlying policies of the demonstrators. This is the most natural setting when collecting data from multiple human operators, yet remains unexplored. Since different demonstrators induce different data distributions, we show that this can be naturally framed as a domain generalization problem, with each demonstrator corresponding to a different domain. Specifically, we propose Domain-Invariant Model-based Offline RL (DIMORL), where we apply Risk Extrapolation (REx) (Krueger et al., 2020) to the process of learning dynamics and rewards models. Our results show that models trained with REx exhibit improved domain generalization performance when compared with the natural baseline of pooling all demonstrators' data. We observe that the resulting models frequently enable the learning of superior policies in the offline model-based RL setting, can improve the stability of the policy learning process, and potentially increase exploration. Link » Alan Clark · Shoaib Siddiqui · Robert Kirk · Usman Anwar · Stephen Chung · David Krueger 🔗 - Squeezing more value out of your historical data: data-augmented behavioural cloning as launchpad for reinforcement learning (Poster)  link »    In many real-world applications collecting large, high-quality datasets may be too costly or impractical. Offline reinforcement learning (RL) aims to infer an optimal decision-making policy from a fixed set of data. Getting the most information from this dataset is then vital for good performance. We propose a model-based data augmentation strategy, Trajectory Stitching (TS), to improve the quality of sub-optimal trajectories. TS introduces unseen actions joining previously disconnected states: using a probabilistic notion of state reachability, it effectively stitches' together parts of the historical demonstrations to generate new, higher quality ones. A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action. New actions are introduced only when they are expected to be beneficial, according to an estimated state-value function. We show that using supervised learning, behavioural cloning (BC), to extract a decision-making policy from the new TS dataset, leads to improvements over the behaviour-cloned policy from the original dataset. Improving over the BC policy could then be used as a launchpad for online RL through planning and demonstration-guided RL. Link » Charles Hepburn · Giovanni Montana 🔗 - Keep Calm and Carry Offline: Policy refinement in offline reinforcement learning (Poster)  link »    The ability to discover optimal behaviour from fixed data sets has the potential to transfer the successes of reinforcement learning (RL) to domains where data collection is acutely problematic. In this offline setting a key challenge is overcoming overestimation bias for actions not present in data which, without the ability to correct for via interaction with the environment, can propagate and compound during training, thus leading to highly sub-optimal policies. One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC), which encourages agents to pick actions closer to the source data. By finding the right balance between RL and BC such approaches have been shown to be surprisingly effective while requiring minimal changes to the underlying algorithms they are based on. To date, this balance has been held constant but in this work we explore the idea of tipping this balance towards RL following initial training. Using TD3-BC we demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies that outperform the original baseline, as well as match or exceed the performance of more complex alternative approaches. Furthermore, we show these refined policies can be fine-tuned online while largely mitigating severe performance drops. Link » Alex Beeson · Giovanni Montana 🔗 - Guiding Offline Reinforcement Learning Using a Safety Expert (Poster)  link » Offline reinforcement learning is used to train policies in situations where it is expensive or infeasible to access the environment during training. An agent trained under such a scenario does not get corrective feedback once the learned policy starts diverging and may fall prey to the overestimation bias commonly seen in this setting. This increases the chances of the agent choosing unsafe/risky actions, especially in states with sparse to no representation in the training dataset. In this paper, we propose to leverage a safety expert to discourage the offline RL agent from choosing unsafe actions in under-represented states in the dataset. The proposed framework in this paper transfers the safety expert's knowledge in an offline setting for states with high uncertainty to prevent catastrophic failures from occurring in safety-critical domains. We use a simple but effective approach to quantify the state uncertainty based on how frequently they appear in a training dataset. In states with high uncertainty, the offline RL agent mimics the safety expert while maximizing the long-term reward. We modify TD3+BC, an existing offline RL algorithm, as a part of the proposed approach. We demonstrate empirically that our approach performs better than TD3+BC on some control tasks and comparably on others across two sets of benchmark datasets while reducing the chance of taking unsafe actions in sparse regions of the state space. Link » Richa Verma · Kartik Bharadwaj · Harshad Khadilkar · Balaraman Ravindran 🔗 - Pareto-Efficient Decision Agents for Offline Multi-Objective Reinforcement Learning (Poster)  link »    The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics. Link » Baiting Zhu · Meihua Dang · Aditya Grover 🔗 - Revisiting Bellman Errors for Offline Model Selection (Poster)  link » Applying offline reinforcement learning in real-world settings necessitates the ability to tune hyperparameters offline, a task known as $\textit{offline model selection}$. It is well-known that the empirical Bellman errors are poor predictors of value function estimation accuracy and policy performance. This has led researchers to abandon model selection procedures based on Bellman errors and instead focus on evaluating the expected return under policies of interest. The problem with this approach is that it can be very difficult to use an offline dataset generated by one policy to estimate the expected returns of a different policy. In contrast, we argue that Bellman errors can be useful for offline model selection, and that the discouraging results in past literature has been due to estimating and utilizing them incorrectly. We propose a new algorithm, $\textit{Supervised Bellman Validation}$, that estimates the expected squared Bellman error better than the empirical Bellman errors. We demonstrate the relative merits of our method over competing methods through both theoretical results and empirical results on datasets from the Atari benchmark. We hope that our results will challenge current attitudes and spur future research into Bellman errors and their utility in offline model selection. Link » Joshua Zitovsky · Rishabh Agarwal · Daniel de Marchi · Michael Kosorok 🔗 - Boosting Offline Reinforcement Learning via Data Resampling (Poster)  link » Offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. To address this problem, existing works mainly focus on designing sophisticated algorithms to explicitly or implicitly constrain the learned policy to be close to the behavior policy. The constraint applies not only to well-performing actions but also to inferior ones, which limits the upper bound of the learned policy. Instead of aligning the densities of two distributions, aligning the supports gives a relaxed constraint while still being able to avoid out-of-distribution actions. Therefore, we propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. More specifically, we construct a better behavior policy by resampling each transition in an old dataset according to its episodic return. We dub our method \name (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time. Extensive experiments demonstrate that \name is effective at boosting offline RL performance and orthogonal to decoupling strategies in long-tailed classification. New state-of-the-arts are achieved on the D4RL benchmark. Link » Yang Yue · Bingyi Kang · Xiao Ma · Zhongwen Xu · Gao Huang · Shuicheng Yan 🔗 - General policy mapping: online continual reinforcement learning inspired on the insect brain (Poster)  link »    We have developed a model for online continual reinforcement learning (RL) inspired on the insect brain. Our model leverages the offline training of a feature extraction and a common general policy layer to enable the convergence of RL algorithms in online settings. Sharing a common policy layer across tasks leads to positive backward transfer, where the agent continuously improved in older tasks sharing the same underlying general policy. Biologically inspired restrictions to the agent's network are key for the convergence of RL algorithms. This provides a pathway towards efficient online RL in resource-constrained scenarios. Link » Angel Yanguas-Gil · Sandeep Madireddy 🔗 - Offline Reinforcement Learning with Closed-Form Policy Improvement Operators (Poster)  link »    Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement (CFPI) operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to our CFPI operators. We instantiate an offline RL algorithm with our novel policy improvement operator and empirically demonstrate its effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Link » Jiachen Li · Edwin Zhang · Ming Yin · Qinxun Bai · Yu-Xiang Wang · William Yang Wang 🔗 - On- and Offline Multi-agent Reinforcement Learning for Disease Mitigation using Human Mobility Data (Poster)  link » The COVID-19 pandemic generates new real-world data-driven problems such as predicting case surges, managing resource depletion, or modeling geo-spatial infection spreading. Though reinforcement learning (RL) has been previously proposed to optimize regional lock-downs, the availability of mobility tracking data with offline RL allows us to push decision making from the top-down perspective (i.e., driven by governments) to the bottom up perspective (i.e., driven by individuals). Rather than predicting the outcome of the outbreak, we utilize offline RL as a tool, along with epidemic modeling, to empower collaborative decision-making at the individual level. In our investigations, we ask whether we can train the population of a city to become more resilient against infectious diseases? To investigate, we deploy a 'city' of 10,000 agents loaded with real visits at Points of Interest (POIs) (e.g., restaurants, gyms, parks) throughout a target metropolitan area during the COVID-19 pandemic (July 2020). Using a standard disease compartmental model, we find that the city of trained agents can reduce disease transmissions by 60%. This opens a new direction in using offline RL as a springboard to further the research at the intersection of artificial intelligence and disease mitigation. Link » Sofia Hurtado · Radu Marculescu 🔗 - Contrastive Example-Based Control (Poster)  link » While there are many real-world problems that might benefit from reinforcement learning, these problems rarely fit into the MDP mold: interacting with the environment is often prohibitively expensive and specifying reward functions is challenging. Motivated by these challenges, prior work has developed data-driven approaches that learn entirely from samples from the transition dynamics and examples of high-return states. These methods typically learn a reward function from the high-return states, use that reward function to label the transitions, and then apply an offline RL algorithm to these transitions. While these methods can achieve good results on many tasks, they can be complex, carefully regularizing the reward function and using temporal difference updates. In this paper, we propose a simple and scalable approach to offline example-based control. Unlike prior approaches (e.g., ORIL, VICE, PURL) that learn a reward function, our method will learn an implicit model of multi-step transitions. We show that this implicit model can represent the Q-values for the example-based control problem. Thus, whereas a learned reward function must be combined with an RL algorithm to determine good actions, our model can directly be used to determine these good actions. Across a range of state-based and image-based offline control tasks, we find that our method outperforms baselines that use learned reward functions. Link » Kyle Hatch · Sarthak J Shetty · Benjamin Eysenbach · Tianhe Yu · Rafael Rafailov · Russ Salakhutdinov · Sergey Levine · Chelsea Finn 🔗 - Offline Policy Evaluation for Reinforcement Learning with Adaptively Collected Data (Poster)  link » Offline RL is an important step towards making data-hungry RL algorithms more widely usable in the real world, but conventional assumptions on the distribution of logging data do not apply in some key real-world scenarios. In particular, it is unrealistic to assume that RL practitioners will have access to sets of trajectories that simultaneously are mutually independent and explore well. We propose two natural ways to relax these assumptions: by allowing the data to be distributed according to different logging policies independently, and by allowing logging policies to depend on past trajectories. We discuss Offline Policy Evaluation (OPE) in these settings, analyzing the performance of a model-based OPE estimator when the MDP is tabular. Link » Sunil Madhow · Dan Qiao · Yu-Xiang Wang 🔗 - Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies (Poster)  link »    Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive; such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there exists no single established protocol for evaluating offline RL methods. In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets. Link » Shivakanth Sujit · Pedro Braga · Jörg Bornschein · Samira Ebrahimi Kahou 🔗 - Offline Policy Comparison with Confidence: Benchmarks and Baselines (Poster)  link » Decision makers often wish to use offline historical data to compare sequential-action policies at various world states. Importantly, computational tools should produce confidence values for such offline policy comparison (OPC) to account for statistical variance and limited data coverage. Nevertheless, there is little work that directly evaluates the quality of confidence values for OPC. In this work, we address this issue by creating benchmarks for OPC with Confidence (OPCC), derived by adding sets of policy comparison queries to datasets from offline reinforcement learning. In addition, we present an empirical evaluation of the \emph{risk versus coverage} trade-off for a class of model-based baselines. In particular, the baselines learn ensembles of dynamics models, which are used in various ways to produce simulations for answering queries with confidence values. While our results suggest advantages for certain baseline variations, there appears to be significant room for improvement in future work. Link » Anurag Koul · Mariano Phielipp · Alan Fern 🔗 - Residual Model-Based Reinforcement Learning for Physical Dynamics (Poster)  link »    Dynamic control problems are a prevalent topic in robotics. Deep neural networks have been shown to learn accurately many complex dynamics, but these approaches remain data-inefficient or intractable in some tasks. Rather than learning to reproduce the environment dynamics, traditional control approaches use some physical knowledge to describe the environment's evolution. These approaches do not need many samples to be tuned but suffer from approximations and are not adapted to strong modifications of the environment. In this paper, we introduce a method to learn the parameters of a physical model \ie the parameter of an Ordinary Differential Equation (ODE) to approach at best the observed transitions. This model is completed with a residual data-driven term in charge to reduce the reality gap between simple physical priors and complex environments. We also show that this approach can be naturally extended to the case of the fine-tuning of an implicit physical model trained on simple simulations. Link » Zakariae EL ASRI · Clément Rambour · Vincent LE GUEN · Nicolas THOME 🔗 - Raisin: Residual Algorithms for Versatile Offline Reinforcement Learning (Poster)  link » The residual gradient algorithm (RG), gradient descent of the Mean Squared Bellman Error, brings robust convergence guarantees to bootstrapped value estimation. Meanwhile, the far more common semi-gradient algorithm (SG) suffers from well-known instabilities and divergence. Unfortunately, RG often converges slowly in practice. Baird (1995) proposed residual algorithms (RA), weighted averaging of RG and SG, to combine RG's robust convergence and SG's speed. RA works moderately well in the online setting. We find, however, that RA works disproportionately well in the offline setting. Concretely, we find that merely adding a variable residual component to SAC increases its score on D4RL gym tasks by a median factor of 54. We further show that using the minimum of ten critics lets our algorithm match SAC-$N$'s state-of-the-art returns using 50$\times$ less compute and no additional hyperparameters. In contrast, TD3+BC with the same minimum-of-ten-critics trick does not match SAC-$N$'s returns on a handful of environments. Link » Braham Snyder · Yuke Zhu 🔗 - Collaborative symmetricity exploitation for offline learning of hardware design solver (Poster)  link » This paper proposes \textit{collaborative symmetricity exploitation} (\ourmethod{}) framework to train a solver for the decoupling capacitor placement problem (DPP), one of the significant hardware design problems. Due to the sequentially coupled multi-level property of the hardware design process, the design condition of DPP changes depending on the design of higher-level problems. Also, the online evaluation of real-world electrical performance through simulation is extremely costly. Thus, we propose the \ourmethod{} framework that allows data-efficient offline learning of a DPP solver (i.e., contextualized policy) with high generalization capability over changing task conditions. Leveraging the symmetricity for offline learning of hardware design solver increases data-efficiency by reducing the solution space and improves generalization capability by capturing the invariant nature present regardless of changing conditions. Extensive experiments verified that \ourmethod{} with zero-shot inference outperforms the neural baselines and iterative conventional design methods on the DPP benchmark. Furthermore, \ourmethod{} greatly outperformed the expert method used to generate the offline data for training. Link » HAEYEON KIM · Minsu Kim · joungho kim · Jinkyoo Park 🔗 - SPRINT: Scalable Semantic Policy Pre-training via Language Instruction Relabeling (Poster)  link »    We propose SPRINT, an approach for scalable offline policy pre-training based on natural language instructions. SPRINT pre-trains an agent’s policy to execute a diverse set of semantically meaningful skills that it can leverage to learn new tasks faster. Prior work on offline pre-training required tedious manual definition of pre-training tasks or learned semantically meaningless skills via random goal-reaching. Instead, our approach SPRINT (Scalable Pre-training via Relabeling Language INsTructions) leverages natural language instruction labels on offline agent experience, collected at scale (e.g., via crowd-sourcing), to define a rich set of tasks with minimal human effort. Furthermore, by using natural language to define tasks, SPRINT can use pre-trained large language models to automatically expand the initial task set. By relabeling and aggregating task instructions, even across multiple training trajectories, we can learn a large set of new skills during pre-training. In experiments using a realistic household simulator, we show that agents pre-trained with SPRINT learn new long-horizon household tasks substantially faster than with previous pre-training approaches. Link » Jesse Zhang · Karl Pertsch · Jiahui Zhang · Taewook Nam · Sung Ju Hwang · Xiang Ren · Joseph Lim 🔗 - Bayesian Q-learning With Imperfect Expert Demonstrations (Poster)  link »    Guided exploration with expert demonstrations improves data efficiency for reinforcement learning, but current algorithms often overuse expert information. We propose a novel algorithm to speed up Q-learning with the help of a limited amount of imperfect expert demonstrations. The algorithm avoids excessive reliance on expert data by relaxing the optimal expert assumption and gradually reducing the usage of uninformative expert data. Experimentally, we evaluate our approach on a sparse-reward chain environment and six more complicated Atari games with delayed rewards. We can achieve better results with the proposed methods than Deep Q-learning from Demonstrations (Hester et al., 2017) in most environments. Link » Fengdi Che · Xiru Zhu · Doina Precup · David Meger · Gregory Dudek 🔗 - Can Active Sampling Reduce Causal Confusion in Offline Reinforcement Learning? (Poster)  link » Causal confusion is a phenomenon where an agent learns a policy that reflects imperfect spurious correlations in the data. The resulting causally confused behaviors may appear desirable during training but may fail at deployment. This problem gets exacerbated in domains such as robotics with potentially large gaps between open- and closed-loop performance of an agent. In such cases, a causally confused model may appear to perform well according to open-loop metrics but fail catastrophically when deployed in the real world. In this paper, we conduct the first study of causal confusion in offline reinforcement learning and hypothesize that selectively sampling data points that may help disambiguate the underlying causal mechanism of the environment may alleviate causal confusion. To investigate this hypothesis, we consider a set of simulated setups to study causal confusion and the ability of active sampling schemes to reduce its effects. We provide empirical evidence that random and active sampling schemes are able to consistently reduce causal confusion as training progresses and that active sampling is able to do so more efficiently than random sampling. Link » Gunshi Gupta · Tim G. J. Rudner · Rowan McAllister · Adrien Gaidon · Yarin Gal 🔗 - Trajectory-based Explainability Framework for Offline RL (Poster)  link » Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). Link » Shripad Deshmukh · Arpan Dasgupta · Chirag Agarwal · Nan Jiang · Balaji Krishnamurthy · Georgios Theocharous · Jayakumar Subramanian 🔗 - AMORE: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data (Poster)  link » We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (AMORE), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, AMORE is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of AMORE never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned and the baseline policy is supported by the data. Such a robust policy improvement property makes AMORE especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring. Link » Tengyang Xie · Mohak Bhardwaj · Nan Jiang · Ching-An Cheng 🔗 - Balanced Off-Policy Evaluation for Personalized Pricing (Poster)  link » We consider a feature-based pricing problem, where we have data consisting of feature information, historical pricing decisions, and binary realized demand. We wish to evaluate a new personalized pricing policy that map features to prices. This problem is known as off-policy evaluation and there is extensive literature on estimating the expected performance of the new policy. However, existing methods perform poorly when the logging policy has little exploration, which is common in pricing. We propose a novel method that exploits the special structure of pricing problems and incorporates downstream optimization problems when evaluating the new policy. We establish theoretical convergence guarantees, and we empirically demonstrate the advantage of our method using a real world pricing dataset. Link » Adam N. Elmachtoub · Vishal Gupta · YUNFAN ZHAO 🔗 - ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation Learning (Poster)  link » Given a dataset of interactions with an environment of interest, a viable method to extract an agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is mean-seeking with respect to the state-conditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we develop a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms BC by being mode-seeking in nature. Link » Eddy Hudson · Ishan Durugkar · Garrett Warnell · Peter Stone 🔗 - Dynamics-Augmented Decision Transformer for Offline Dynamics Generalization (Poster)  link » Recent progress in offline reinforcement learning (RL) has shown that it is often possible to train strong agents without potentially unsafe or impractical online interaction. However, in real-world settings, agents may encounter unseen environments with different dynamics, and generalization ability is required. This work presents Dynamics-Augmented Decision Transformer (DADT), a simple yet efficient method to train generalizable agents from offline datasets; on top of return-conditioned policy using the transformer architecture, we improve generalization capabilities by using representation learning based on next state prediction. Our experimental results demonstrate that DADT outperforms prior state-of-the-art methods for offline dynamics generalization. Intriguingly, DADT without fine-tuning even outperforms fine-tuned baselines. Link » Changyeon Kim · Junsu Kim · Younggyo Seo · Kimin Lee · Honglak Lee · Jinwoo Shin 🔗 - Offline Reinforcement Learning on Real Robot with Realistic Data Sources (Poster)  link »    Offline Reinforcement Learning (ORL) provides a framework to train control policies from fixed sub-optimal datasets, making it suitable for safety-critical applications like robotics. Despite significant algorithmic advances and benchmarking in simulation, the evaluation of ORL algorithms on real-world robot learning tasks has been limited. Since real robots are sensitive to details like sensor noises, reset conditions, demonstration sources, and test time distribution, it remains a question whether ORL is a competitive solution to real robotic challenges and what would characterize such tasks. We aim to address this deficiency through an empirical study of representative ORL algorithms on four table-top manipulation tasks using a Franka-Panda robot arm. Our evaluation finds that for scenarios with sufficient in-domain data of high quality, specialized ORL algorithms can be competitive with the behavior cloning approach. However, for scenarios that require out-of-distribution generalization or task transfer, ORL algorithms can learn and generalize from offline heterogeneous datasets and outperform behavior cloning. Project URL: https://sites.google.com/view/real-orl-anon Link » Gaoyue Zhou · Liyiming Ke · Siddhartha Srinivasa · Abhinav Gupta · Aravind Rajeswaran · Vikash Kumar 🔗 - Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows (Poster)  link »    Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning.This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets. Link » Dmitry Akimov · Alexander Nikulin · Vladislav Kurenkov · Denis Tarasov · Sergey Kolesnikov 🔗 - Matrix Estimation for Offline Evaluation in Reinforcement Learning with Low-Rank Structure (Poster)  link » We consider offline Reinforcement Learning (RL), where the agent does not interact with the environment and must rely on offline data collected using a behavior policy. Previous works provide policy evaluation guarantees when the target policy to be evaluated is covered by the behavior policy, that is, state-action pairs visited by the target policy must also be visited by the behavior policy. We show that when the MDP has a latent low-rank structure, this coverage condition can be relaxed. Building on the connection to weighted matrix completion with non-uniform observations, we propose an offline policy evaluation algorithm that leverages the low-rank structure to estimate the values of uncovered state-action pairs. Our algorithm does not require a known feature representation, and our finite-sample error bound involves a novel discrepancy measure quantifying the discrepancy between the behavior and target policies in the spectral space. We provide concrete examples where our algorithm achieves accurate estimation while existing coverage conditions are not satisfied. Link » Xumei Xi · Christina Yu · Yudong Chen 🔗 - Train Offline, Test Online: A Real Robot Learning Benchmark (Poster)  link »    Three challenges limit the progress of robot learning research: robots are expensive (few labs can participate), everyone uses different robots (findings do not generalize across labs), and we lack internet-scale robotics data. We take on these challenges via a new benchmark: Train Offline, Test Online (TOTO). TOTO provides remote users with access to shared robots for evaluating methods on common tasks and an open-source dataset of these tasks for offline training. Its manipulation task suite requires challenging generalization to unseen objects, positions, and lighting. We present initial results on TOTO comparing five pretrained visual representations and four offline policy learning baselines, remotely contributed by five institutions. The real promise of TOTO, however, lies in the future: we release the benchmark for additional submissions from any user, enabling easy, direct comparison to several methods without the need to obtain hardware or collect data. Link » Gaoyue Zhou · Victoria Dean · Mohan Kumar Srirama · Aravind Rajeswaran · Jyothish Pari · Kyle Hatch · Aryan Jain · Tianhe Yu · Pieter Abbeel · Lerrel Pinto · Chelsea Finn · Abhinav Gupta 🔗 - Hybrid RL: Using both offline and online data can make RL efficient (Poster)  link »    We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma’s Revenge. Link » Yuda Song · Yifei Zhou · Ayush Sekhari · J. Bagnell · Akshay Krishnamurthy · Wen Sun 🔗 - Choreographer: Learning and Adapting Skills in Imagination (Poster)  link »    We present Choreographer, a model-based agent that exploits its world model to learn and adapt skills in imagination. Choreographer is able to learn skills from offline unlabeled data and leverage them for effectively adapting to downstream tasks and for exploring the environment thoroughly, to find sparse rewards. Our method decouples the exploration and skill learning processes, being able to discover skills in the latent state space of the model. For adapting to downstream tasks, the agent uses a meta-controller to evaluate and adapt the learned skills efficiently by deploying them in parallel in imagination. Project website: https://doubleblind-repos.github.io/ Link » Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt · Alexandre Lacoste · Sai Rajeswar Mudumba 🔗 - CORL: Research-oriented Deep Offline Reinforcement Learning Library (Poster)  link »    CORL is an open-source library that provides single-file implementations of Deep Offline Reinforcement Learning algorithms. It emphasizes a simple developing experience with a straightforward codebase and a modern analysis tracking tool. In CORL, we isolate methods implementation into distinct single files, making performance-relevant details easier to recognise. Additionally, an experiment tracking feature is available to help log metrics, hyperparameters, dependencies, and more to the cloud. Finally, we have ensured the reliability of the implementations by benchmarking a commonly employed D4RL benchmark. Link » Denis Tarasov · Alexander Nikulin · Dmitry Akimov · Vladislav Kurenkov · Sergey Kolesnikov 🔗 - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size (Poster)  link »    Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 2.5x times on average. Link » Alexander Nikulin · Vladislav Kurenkov · Denis Tarasov · Dmitry Akimov · Sergey Kolesnikov 🔗 - Offline Reinforcement Learning for Customizable Visual Navigation (Poster)  link » Robotic navigation often requires not only reaching a distant goal, but also satisfying intermediate user preferences on the path, such as obeying the rules of the road or preferring some surfaces over others. Our goal in this paper is to devise a robotic navigation system that can utilize previously collect data to learn navigational strategies that are responsive to user-specified utility functions, such as preferring specific surfaces or staying in sunlight (e.g., to maintain solar power). To this end, we show how offline reinforcement learning can be used to learn reward-specific value functions for long-horizon navigation that can then be composed with planning methods to reach distant goals, while still remaining responsive to user-specified navigational preferences. This approach can utilize large amounts of previously collected data, which is relabeled with the task reward. This makes it possible to incorporate diverse data sources and enable effective generalization in the real world, without any simulation, task-specific data collection, or demonstrations. We evaluate our system, ReViND, using a large navigational dataset from prior work, without any data collection specifically for the reward functions that we test. We demonstrate that our system can control a real-world ground robot to navigate to distant goals using only offline training from this dataset, and exhibit behaviors that qualitatively differ based on the user-specified reward function. Link » Dhruv Shah · Arjun Bhorkar · Hrishit Leen · Ilya Kostrikov · Nicholas Rhinehart · Sergey Levine 🔗 - Efficient Planning in a Compact Latent Action Space (Poster)  link » Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces.However, planning usually brings significant computational overhead for decision making, so scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes from offline data. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines. Link » zhengyao Jiang · Tianjun Zhang · Michael Janner · Yueying (Lisa) Li · Tim Rocktäschel · Edward Grefenstette · Yuandong Tian 🔗 - User-Interactive Offline Reinforcement Learning (Poster)  link »    Offline reinforcement learning algorithms are still not fully trusted by practitioners due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their arguably most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. Link » Phillip Swazinna · Steffen Udluft · Thomas Runkler 🔗 - Does Zero-Shot Reinforcement Learning Exist? (Poster)  link » A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards controllable agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL have been suggested using successor features (SFs) (Borsa et al., 2018) or forward-backward (FB) representations (Touati & Ollivier, 2021), but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark (Laskin et al., 2021). To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers.SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse dynamics, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching $85 \%$ of supervised RL performance with a good replay buffer, in a zero-shot manner. Link » Ahmed Touati · Jérémy Rapin · Yann Ollivier 🔗 - State Advantage Weighting for Offline RL (Poster)  link » We present \textit{state advantage weighting} for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online. Link » Jiafei Lyu · aicheng Gong · Le Wan · Zongqing Lu · Xiu Li 🔗 - Optimal Transport for Offline Imitation Learning (Poster)  link »    With the advent of large datasets, offline reinforcement learning is a promising framework for learning good decision-making policies without the need to interact with the real environment.However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive.In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that can assign rewards to offline trajectories, with a few high-quality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards. Link » Yicheng Luo · zhengyao Jiang · Samuel Cohen · Edward Grefenstette · Marc Deisenroth 🔗 - Control Graph as Unified IO for Morphology-Task Generalization (Poster)  link »    The rise of generalist large-scale models in natural language and vision has made us expect that a massive data-driven approach could achieve broader generalization in other domains such as continuous control. In this work, we explore a method for learning a single policy that manipulates various forms of agents to solve various tasks by distilling a large amount of proficient behavioral data. In order to align input-output (IO) interface among multiple tasks and diverse agent morphologies while preserving essential 3D geometric relations, we introduce control graph, which treats observations, actions and goals/task in a unified graph representation. We also develop MxT-Bench for fast large-scale behavior generation, which supports procedural generation of diverse morphology-task combinations with a minimal blueprint and hardware-accelerated simulator. Through efficient representation and architecture selection on MxT-Bench, we find out that a control graph representation coupled with Transformer architecture improves the multi-task performances compared to other baselines including recent discrete tokenization, and provides better prior knowledge for zero-shot transfer or sample efficiency in downstream multi-task imitation learning. Our work suggests large diverse offline datasets, unified IO representation, and policy representation and architecture selection through supervised learning form a promising approach for studying and advancing morphology task generalization. Link » Hiroki Furuta · Yusuke Iwasawa · Yutaka Matsuo · Shixiang (Shane) Gu 🔗 - Mutual Information Regularized Offline Reinforcement Learning (Poster)  link » Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new state-of-the-art on various tasks of the D4RL benchmark. Link » Xiao Ma · Bingyi Kang · Zhongwen Xu · Min Lin · Shuicheng Yan 🔗 - Uncertainty-Driven Pessimistic Q-Ensemble for Offline-to-Online Reinforcement Learning (Poster)  link » Re-using existing offline reinforcement learning (RL) agents is an emerging topic for reducing the dominant computational cost for exploration in many settings. To effectively fine-tune the pre-trained offline policies, both offline samples and online interactions may be leveraged. In this paper, we propose the idea of incorporating a pessimistic Q-ensemble and an uncertainty quantification technique to effectively fine-tune offline agents. To stabilize online Q-function estimates during fine-tuning, the proposed method uses uncertainty estimation as a penalization for a replay buffer with a mixture of online interactions from the ensemble agent and offline samples from the behavioral policies. In various robotic tasks on D4RL benchmark, we show that our method outperforms the state-of-the-art algorithms in terms of the average return and the sample efficiency. Link » Ingook Jang · Seonghyun Kim 🔗 - Offline Robot Reinforcement Learning with Uncertainty-Guided Human Expert Sampling (Poster)  link » Recent advances in batch (offline) reinforcement learning have shown promising results towards learning from available offline data and proved offline RL to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline RL algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning (CQL) with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments. Link » Ashish Kumar · Ilya Kuzovkin 🔗 - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation (Poster)  link » We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Link » Dan Qiao · Yu-Xiang Wang 🔗 - Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training (Poster)  link »    Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\textbf{V}$alue-$\textbf{I}$mplicit $\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and real-robot tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, few-shot offline RL on a suite of real-world robot tasks with as few as 20 trajectories. Link » Jason Yecheng Ma · Shagun Sodhani · Dinesh Jayaraman · Osbert Bastani · Vikash Kumar · Amy Zhang 🔗 - Imitation from Observation With Bootstrapped Contrastive Learning (Poster)  link » Imitation from observation is a paradigm that consists of training agents using observations of expert demonstrations without direct access to the actions. Depending on the problem configuration, these demonstrations can be sequences of states or raw visual observations.One of the most common procedures adopted to solve this problem is to train a reward function from the demonstrations, but this task still remains a significant challenge.We approach this problem with a method of agent behavior representation in a latent space using demonstration videos.Our approach exploits recent algorithms of contrastive learning of image and video and uses a bootstrapping method to progressively train a trajectory encoding function with respect to the variation of the agent policy. This function is then used to compute the rewards provided to a standard Reinforcement Learning (RL) algorithm.Our method uses only a limited number of videos produced by an expert and we do not have access to the expert policy function.Our experiments show promising results on a set of continuous control tasks and demonstrate that learning a behavior encoder from videos allows building an efficient reward function for the agent. Link » Medric Sonwa · Johanna Hansen · Eugene Belilovsky 🔗 - Provable Benefits of Representational Transfer in Reinforcement Learning (Poster)  link » We study the problem of representational transfer in RL, where an agent first pretrains offline in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy online in a target task. We propose a new notion of task relatedness between source and target tasks and develop a novel approach for representational transfer under this assumption. Concretely, we show that given generative access to a set of source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy, with only online access to the target task. The sample complexity is close to knowing the ground truth features in the target task and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access and validate our findings with empirical evaluation on rich observation MDPs that requires deep exploration. Link » Alekh Agarwal · Yuda Song · Kaiwen Wang · Mengdi Wang · Wen Sun · Xuezhou Zhang 🔗 - A Connection between One-Step Regularization and Critic Regularization in Reinforcement Learning (Poster)  link » As with any machine learning problem with limited data, effective offline RL algorithms require careful regularization to avoid overfitting. One-step methods perform regularization by doing just a single step of policy improvement, while critic regularization methods do many steps of policy improvement with a regularized objective. These methods appear distinct. One-step methods, such as advantage-weighted regression and conditional behavioral cloning, truncate policy iteration after just one step. This `early stopping'' makes one-step RL simple and stable, but can limit its asymptotic performance. Critic regularization typically requires more compute but has appealing lower-bound guarantees. In this paper, we draw a close connection between these methods: applying a multi-step critic regularization method with a regularization coefficient of 1 yields the same policy as one-step RL. While practical implementations violate our assumptions and critic regularization is typically applied with smaller regularization coefficients, our experiments nevertheless show that our analysis makes accurate, testable predictions about practical offline RL methods (CQL and one-step RL) with commonly-used hyperparameters. Our results that every problem can be solved with a single step of policy improvement, but rather that one-step RL might be competitive with critic regularization on RL problems that demand strong regularization. Link » Benjamin Eysenbach · Matthieu Geist · Sergey Levine · Russ Salakhutdinov 🔗 - Offline evaluation in RL: soft stability weighting to combine fitted Q-learning and model-based methods (Poster)  link » The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data under a different distribution. Because no one method is uniformly best, model selection is important, but difficult without online exploration. We propose soft stability weighting (SSW) for adaptively combining offline estimates from ensembles of fitted-Q-evaluation (FQE) and model-based evaluation methods generated by different random initializations of neural networks. Soft stability weighting computes a state-action-conditional weighted average of the median FQE and model-based prediction by normalizing the state-action-conditional standard deviation of ensembles of both methods relative to the average standard deviation of each method. Therefore it compares the relative stability of predictions in the ensemble to the perturbations from random initializations, drawn from a truncated normal distribution scaled by the input feature size. Link » Briton Park · Xian Wu · Bin Yu · Angela Zhou 🔗 - Using Confounded Data in Offline RL (Poster)  link » In this work we consider the problem of confounding in offline RL, also called the delusion problem. While it is known that learning from purely offline data is a hazardous endeavor in the presence of confounding, in this paper we show that offline, confounded data can be safely combined with online, non-confounded data to improve the sample-efficiency of model-based RL. We import ideas from the well-established framework of $do$-calculus to express model-based RL as a causal inference problem, thus bridging the fields of RL and causality. We propose a latent-based method which we prove is correct and efficient, in the sense that it attains better generalization guarantees thanks to the offline, confounded data (in the asymptotic case), regardless of the expert's behavior. We illustrate the effectiveness of our method on a series of synthetic experiments. Link » Maxime Gasse · Damien GRASSET · Guillaume Gaudron · Pierre-Yves Oudeyer 🔗 - Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement (Poster)  link »    Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks. Link » Michael Chang · Alyssa L Dayan · Franziska Meier · Tom Griffiths · Sergey Levine · Amy Zhang 🔗 - Visual Backtracking Teleoperation: A Data Collection Protocol for Offline Image-Based RL (Poster)  link »    We consider how to most efficiently leverage teleoperator time to collect data for learning robust image-based value functions and policies for sparse reward robotic tasks. To accomplish this goal, we modify the process of data collection to include more than just successful demonstrations of the desired task. Instead we develop a novel protocol that we call Visual Backtracking Teleoperation (VBT), which deliberately collects a dataset of visually similar failures, recoveries, and successes. VBT data collection is particularly useful for efficiently learning accurate value functions from small datasets of image-based observations. We demonstrate VBT on a real robot to perform continuous control from image observations for the deformable manipulation task of T-shirt grasping. We find that by adjusting the data collection process we improve the quality of both the learned value functions and policies over a variety of baseline methods for data collection. Specifically, we find that offline reinforcement learning on VBT data outperforms standard behavior cloning on successful demonstration data by 13% when both methods are given equal-sized datasets of 60 minutes of data from the real robot. Link » David Brandfonbrener · Stephen Tu · Avi Singh · Stefan Welker · Chad Boodoo · Nikolai Matni · Jake Varley 🔗 - Towards Data-Driven Offline Simulations for Online Reinforcement Learning (Poster)  link » Modern decision-making systems, from robots to web recommendation engines, are expected to adapt: to user preferences, changing circumstances or even new tasks. Yet, it is still uncommon to deploy a dynamically learning agent (rather than a fixed policy) to a production system, as it's perceived as unsafe. Using historical data to reason about learning algorithms, similar to offline policy evaluation (OPE) applied to fixed policies, could help practitioners evaluate and ultimately deploy such adaptive agents to production. In this work, we formalize offline learner simulation (OLS) for reinforcement learning (RL) and propose a novel evaluation protocol that measures both fidelity and efficiency. For environments with complex high-dimensional observations, we propose a semi-parametric approach that leverages recent advances in latent state discovery. In preliminary experiments, we show the advantage of our approach compared to fully non-parametric baselines. Link » Shengpu Tang · Felipe Vieira Frujeri · Dipendra Misra · Alex Lamb · John Langford · Paul Mineiro · Sebastian Kochman 🔗 - Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction (Poster)  link » We consider the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of an evaluation policy, $\pi_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more policies that may be different from $\pi_e$. Current OPE algorithms may produce poor OPE estimates under policy distribution shift i.e., when the probability of a particular state-action pair occurring under $\pi_e$ is very different from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of OPE estimation by projecting the ground state-space into a lower-dimensional state-space using concepts from the state abstraction literature in RL. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute distribution correction ratios to produce their OPE estimate. In the original state-space, these ratios may have high variance which may lead to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then present a minimax optimization problem that incorporates the state abstraction. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios. Link » Brahma Pavse · Josiah Hanna 🔗 - Benchmarking Offline Reinforcement Learning Algorithms for E-Commerce Order Fraud Evaluation (Poster)  link » Amazon and other e-commerce sites must employ mechanisms to protect their millions of customers from fraud, such as unauthorized use of credit cards. One such mechanism is order fraud evaluation, where systems evaluate orders for fraud risk, and either “pass” the order, or take an action to mitigate high risk. Order fraud evaluation systems typically use binary classification models that distinguish fraudulent and legitimate orders, to assess risk and take action. We seek to devise a system that considers both financial losses of fraud and long-term customer satisfaction, which may be impaired when incorrect actions are applied to legitimate customers. We propose that taking actions to optimize long-term impact can be formulated as a Reinforcement Learning (RL) problem. Standard RL methods require online interaction with an environment to learn, but this is not desirable in high-stakes applications like order fraud evaluation. Offline RL algorithms learn from logged data collected from the environment, without the need for online interaction, making them suitable for our use case. We show that offline RL methods outperform traditional binary classification solutions in SimStore, a simplified e-commerce simulation that incorporates order fraud risk. We also propose a novel approach to training offline RL policies that adds a new loss term during training, to better align policy exploration with taking correct actions. Link » Soysal Degirmenci · Christopher S Jones 🔗 - Sparse Q-Learning: Offline Reinforcement Learning with Implicit Value Regularization (Poster)  link » Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing Q-values using out-of-distribution actions will suffer from errors due to distributional shift. The recent proposed \textit{In-sample Learning} paradigm (e.g., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose a practical algorithm, which uses the same value regularization as CQL, but in a complete in-sample manner. Compared with IQL, we find that our algorithm introduces sparsity in learning the value function, we thus dub our method Sparse Q-learning (SQL). We verify the effectiveness of SQL on D4RL benchmark datasets. We also show the benefits of sparsity by comparing SQL with IQL in noisy data regimes and show the robustness of in-sample learning by comparing SQL with CQL in small data regimes. Under all settings, SQL achieves better results and owns faster convergence compared to other baselines. Link » Haoran Xu · Li Jiang · Li Jianxiong · Zhuoran Yang · Zhaoran Wang · Xianyuan Zhan 🔗

#### Author Information

##### Rishabh Agarwal (Google Research, Brain Team)

My research work mainly revolves around deep reinforcement learning (RL), often with the goal of making RL methods suitable for real-world problems, and includes an outstanding paper award at NeurIPS.