Offline reinforcement learning (RL) is a reemerging area of study that aims to learn behaviors using only logged data, such as data from previous experiments or human demonstrations, without further environment interaction. It has the potential to make tremendous progress in a number of realworld decisionmaking problems where active data collection is expensive (e.g., in robotics, drug discovery, dialogue generation, recommendation systems) or unsafe/dangerous (e.g., healthcare, autonomous driving, or education). Such a paradigm promises to resolve a key challenge to bringing reinforcement learning algorithms out of constrained lab settings to the real world. The first edition of the offline RL workshop, held at NeurIPS 2020, focused on and led to algorithmic development in offline RL. This year we propose to shift the focus from algorithm design to bridging the gap between offline RL research and realworld offline RL. Our aim is to create a space for discussion between researchers and practitioners on topics of importance for enabling offline RL methods in the real world. To that end, we have revised the topics and themes of the workshop, invited new speakers working on applicationfocused areas, and building on the lively panel discussion last year, we have invited the panelists from last year to participate in a retrospective panel on their changing perspectives.
For details on submission please visit: https://offlinerlneurips.github.io/2021 (Submission deadline: October 6, Anywhere on Earth)
Speakers:
Aviv Tamar (Technion  Israel Inst. of Technology)
Angela Schoellig (University of Toronto)
Barbara Engelhardt (Princeton University)
Sham Kakade (University of Washington/Microsoft)
Minmin Chen (Google)
Philip S. Thomas (UMass Amherst)
Tue 9:00 a.m.  9:10 a.m.

Opening Remarks
SlidesLive Video » 
Rishabh Agarwal · Aviral Kumar 🔗 
Tue 9:10 a.m.  9:40 a.m.

Learning to Explore From Data
(Talk)
SlidesLive Video » 
Aviv Tamar 🔗 
Tue 9:40 a.m.  9:45 a.m.

Q&A for Aviv Tamar
(Q&A)

Aviv Tamar 🔗 
Tue 9:45 a.m.  9:55 a.m.

Contributed Talk 1: What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
(Talk)
SlidesLive Video » 
Ajay Mandlekar 🔗 
Tue 10:00 a.m.  10:10 a.m.

Contributed Talk 2: What Would the Expert do?: Causal Imitation Learning
(Talk)
SlidesLive Video » 
Gokul Swamy 🔗 
Tue 10:15 a.m.  10:25 a.m.

Contributed Talk 3: Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
(Talk)
SlidesLive Video » 
Yunzong Xu · Akshay Krishnamurthy · David SimchiLevi 🔗 
Tue 10:30 a.m.  10:40 a.m.

Contributed Talk 4: PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative QLearning
(Talk)
SlidesLive Video » 
Luckeciano Carvalho Melo 🔗 
Tue 10:40 a.m.  11:45 a.m.

Poster Session 1
(Poster Session)
https://eventhosts.gather.town/app/cIcBclk1rC3IuihY/neuripstemplate8 
🔗 
Tue 11:45 a.m.  11:46 a.m.

Speaker Intro
(Speaker Introduction)

Rishabh Agarwal · Aviral Kumar 🔗 
Tue 11:46 a.m.  12:16 p.m.

Offline RL for Robotics
(Talk)
SlidesLive Video » 
Angela Schoellig 🔗 
Tue 12:16 p.m.  12:21 p.m.

Q&A for Angela Schoellig
(Q&A)

🔗 
Tue 12:21 p.m.  12:22 p.m.

Speaker Intro
(Live short intro)

Rishabh Agarwal · Aviral Kumar 🔗 
Tue 12:22 p.m.  12:52 p.m.

Generalization theory in Offline RL
(Talk)
SlidesLive Video » 
Sham Kakade 🔗 
Tue 12:52 p.m.  12:57 p.m.

Q&A for Sham Kakade
(Q&A)

Sham Kakade 🔗 
Tue 1:00 p.m.  2:00 p.m.

Invited Speaker Panel
(Discussion Panel)
SlidesLive Video » 
Sham Kakade · Minmin Chen · Philip Thomas · Angela Schoellig · Barbara Engelhardt · Doina Precup · George Tucker 🔗 
Tue 2:00 p.m.  3:00 p.m.

Retrospective Panel
(Discussion Panel)
SlidesLive Video » 
Sergey Levine · Nando de Freitas · Emma Brunskill · Finale DoshiVelez · Nan Jiang · Rishabh Agarwal 🔗 
Tue 3:00 p.m.  3:01 p.m.

Speaker Intro

Aviral Kumar · George Tucker 🔗 
Tue 3:01 p.m.  3:31 p.m.

Offline RL for recommendation systems
(Talk)
SlidesLive Video » 
Minmin Chen 🔗 
Tue 3:31 p.m.  3:36 p.m.

Q&A for Minmin Chen
(Q&A)

Minmin Chen 🔗 
Tue 4:06 p.m.  4:07 p.m.

Speaker Intro

Aviral Kumar · George Tucker 🔗 
Tue 4:07 p.m.  4:37 p.m.

Offline Reinforcement Learning for Hospital Patients When Every Patient is Different
(Talk)
SlidesLive Video » 
Barbara Engelhardt 🔗 
Tue 4:37 p.m.  4:42 p.m.

Q&A for Barbara Engelhardt
(Q&A)

🔗 
Tue 4:42 p.m.  4:43 p.m.

Speaker Intro
(Introduction)

🔗 
Tue 4:43 p.m.  5:13 p.m.

Advances in (HighConfidence) OffPolicy Evaluation
(Talk)
SlidesLive Video » 
Philip Thomas 🔗 
Tue 5:13 p.m.  5:19 p.m.

Q&A for Philip Thomas
(Q&A)

Philip Thomas 🔗 
Tue 5:19 p.m.  5:20 p.m.

Closing Remarks & Poster Session
(Closing Remarks)

🔗 
Tue 5:20 p.m.  6:20 p.m.

Poster Session 2
(Poster Session)
https://eventhosts.gather.town/app/cIcBclk1rC3IuihY/neuripstemplate8 
🔗 


Offline Reinforcement Learning with Soft Behavior Regularization
(Poster)
Most prior approaches to offline reinforcement learning (RL) utilize \textit{behavior regularization}, typically augmenting existing offpolicy actor critic algorithms with a penalty measuring divergence between the policy and the offline data. However, these approaches lack guaranteed performance improvement over the behavior policy. In this work, we start from the performance difference between the learned policy and the behavior policy, we derive a new policy learning objective that can be used in the offline setting, which corresponds to the advantage function value of the behavior policy, multiplying by a statemarginal density ratio. We propose a practical way to compute the density ratio and demonstrate its equivalence to a statedependent behavior regularization. Unlike stateindependent regularization used in prior approaches, this \textit{soft} regularization allows more freedom of policy deviation at high confidence states, leading to better performance and stability. We thus term our resulting algorithm Soft Behaviorregularized Actor Critic (SBAC). Our experimental results show that SBAC matches or outperforms the stateoftheart on a set of continuous control locomotion and manipulation tasks. 
Haoran Xu · Xianyuan Zhan · Li Jianxiong · Honglei Yin 🔗 


Instancedependent Offline Reinforcement Learning: From tabular RL to linear MDPs
(Poster)
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches equation (1). We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $
\mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.

Ming Yin · YuXiang Wang 🔗 


DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning
(Poster)
Deep reinforcement learning (RL) has shown great empirical successes, but suffers from brittleness and sample inefficiency. A potential remedy is to use a previouslytrained policy as a source of supervision. In this work, we refer to these policies as teachers and study how to transfer their expertise to new student policies by focusing on data usage. We propose a framework, Data CUrriculum for Reinforcement learning (DCUR), which first trains teachers using online deep RL, and stores the logged environment interaction history. Then, students learn by running either offline RL or by using teacher data in combination with a small amount of selfgenerated data. DCUR’s central idea involves defining a class of data curricula which, as a function of training time, limits the student to sampling from a fixed subset of the full teacher data. We test teachers and students using stateoftheart deep RL algorithms across a variety of data curricula. Results suggest that the choice of data curricula significantly impacts student learning, and that it is beneficial to limit the data during early training stages while gradually letting the data availability grow over time. We identify when the student can learn offline and match teacher performance without relying on specialized offline RL algorithms. Furthermore, we show that collecting a small fraction of online data provides complementary benefits with the data curriculum. Supplementary material is available at https://sites.google.com/view/anondcur/. 
Daniel Seita · Abhinav Gopal · Mandi Zhao · John Canny 🔗 


What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
(Poster)
Imitating human demonstrations is a promising approach to endow robots with various manipulation capabilities. While recent advances have been made in imitation learning and batch (offline) reinforcement learning, a lack of opensource human datasets and reproducible learning methods make assessing the state of the field difficult. In this paper, we conduct an extensive study of six offline learning algorithms for robot manipulation on five simulated and three realworld multistage manipulation tasks of varying complexity, and with datasets of varying quality. Our study analyzes the most critical challenges when learning from offline human data for manipulation. Based on the study, we derive a series of lessons including the sensitivity to different algorithmic design choices, the dependence on the quality of the demonstrations, and the variability based on the stopping criteria due to the different objectives in training and evaluation. We also highlight opportunities for learning from human datasets, such as the ability to learn proficient policies on challenging, multistage tasks beyond the scope of current reinforcement learning methods, and the ability to easily scale to natural, realworld manipulation scenarios where only raw sensory signals are available. Upon acceptance, we will opensource our datasets and all algorithm implementations to facilitate future research and fair comparisons in learning from human demonstration data. Additional results and videos at https://sites.google.com/view/offlinedemostudy 
Ajay Mandlekar · Danfei Xu · Josiah Wong · Chen Wang · Li FeiFei · Silvio Savarese · Yuke Zhu · Roberto MartínMartín 🔗 


TiKick: Toward Playing Multiagent Football Full Games from Singleagent Demonstrations
(Poster)
Deep reinforcement learning (DRL) has achieved superhuman performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multiagent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an endtoend learningbased AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the selfplaying of singleagent experts, which are obtained from league training. We then developed a new offline algorithm to learn a powerful multiagent AI from the fixed singleagent dataset. To the best of our knowledge, Tikick is the first learningbased AI system that can take over the multiagent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pretrained model can accelerate the training process of the modern multiagent algorithm and our method achieves stateoftheart performances on various academic scenarios. 
Shiyu Huang · Wenze Chen · Longfei Zhang · Shizhen Xu · Ziyang Li · Fengming Zhu · Deheng Ye · Ting Chen · Jun Zhu 🔗 


d3rlpy: An Offline Deep Reinforcement Learning Library
(Poster)
In this paper, we introduce d3rlpy, an opensourced offline deep reinforcement learning (RL) library for Python. d3rlpy supports a number of offline deep RL algorithms as well as online algorithms via a userfriendly API. To assist deep RL research and development projects, d3rlpy provides practical and unique features such as data collection, exporting policies for deployment, preprocessing and postprocessing, distributional Qfunctions, multistep learning and a convenient commandline interface. Furthermore, d3rlpy additionally provides a novel graphical interface that enables users to train offline RL algorithms without coding programs. Lastly, the implemented algorithms are benchmarked with D4RL datasets to ensure the implementation quality. 
Takuma Seno · Michita Imai 🔗 


PulseRL: Enabling Offline Reinforcement Learning for Digital Marketing Systems via Conservative QLearning
(Poster)
Digital Marketing Systems (DMS) are the primary point of contact between a digital business and its customers. In this context, the communication channel optimization problem poses a precious and still open challenge for DMS. Due to its interactive nature, Reinforcement Learning (RL) appears as a promising formulation for this problem. However, the standard RL setting learns from interacting with the environment, which is costly and dangerous for production systems. Furthermore, it also fails to learn from historical interactions due to the distributional shift between the collection and learning policies. For this matter, we present PulseRL, an offline RLbased production system for communication channel optimization built upon the Conservative QLearning (CQL) Framework. PulseRL architecture comprises the whole engineering pipeline (data processing, training, deployment, and monitoring), scaling to handle millions of users. Using CQL, PulseRL learns from historical logs, and its learning objective reduces the shift problem by mitigating the overestimation bias from outofdistribution actions. We conducted experiments in a realworld DMS. Results show that PulseRL surpasses RL baselines with a significant margin in the online evaluation. They also validate the theoretical properties of CQL in a complex scenario with high sampling error and nonlinear function approximation. 
Luckeciano Carvalho Melo · Luana G B Martins · Bryan Lincoln de Oliveira · Bruno Brandão · Douglas Winston Soares · Telma Lima 🔗 


Latent Geodesics of Model Dynamics for Offline Reinforcement Learning
(Poster)
Modelbased offline reinforcement learning approaches generally rely on bounds of model error. While contemporary methods achieve such bounds through an ensemble of models, we propose to estimate them using a datadriven latent metric. Particularly, we build upon recent advances in Riemannian geometry of generative models to construct a latent metric of an encoderdecoder based forward model. Our proposed metric measures both the quality of out of distribution samples as well as the discrepancy of examples in the data. We show that our metric can be viewed as a combination of two metrics, one relating to proximity and the other to epistemic uncertainty. Finally, we leverage our metric in a pessimistic modelbased framework, showing a significant improvement upon contemporary modelbased offline reinforcement learning benchmarks. 
Guy Tennenholtz · Nir Baram · Shie Mannor 🔗 


Domain Knowledge Guided Offline Q Learning
(Poster)
Offline reinforcement learning (RL) is a promising method for applications where direct exploration is not possible but a decent initial model is expected for the online stage. In practice, offline RL can underperform because of overestimation attributed to distributional shift between the training data and the learned policy. A common approach to mitigating this issue is to constrain the learned policies so that they remain close to the fixed batch of interactions. This method is typically used without considering the application context. However, domain knowledge is available in many realworld cases and may be utilized to effectively handle the issue of outofdistribution actions. Incorporating domain knowledge in training avoids additional function approximation to estimate the behavior policy and results in easytointerpret policies. To encourage the adoption of offline RL in practical applications, we propose the Domain Knowledge guided Q learning (DKQ). We show that DKQ is a conservative approach, where the unique fixed point still exists and is upper bounded by the standard optimal Q function. DKQ also leads to lower chance of overestimation. In addition, we demonstrate the benefit of DKQ empirically via a novel, realworld case study  guided family tree building, which appears to be the first application of offline RL in genealogy. The results show that guided by proper domain knowledge, DKQ can achieve similar offline performance as standard Q learning and is better aligned with the behavior policy revealed from the data, indicating a lower risk of overestimation on unseen actions. Further, we demonstrate the efficiency and flexibility of DKQ with a classical control problem. 
Xiaoxuan Zhang · Sijia Zhang · YenYun Yu 🔗 


Understanding the Effects of Dataset Characteristics on Offline Reinforcement Learning
(Poster)
In real world, affecting the environment by a weak policy can be expensive or very risky, therefore hampers real world applications of reinforcement learning. Offline Reinforcement Learning (RL) can learn policies from a given dataset without interacting with the environment. However, the dataset is the only source of information for an Offline RL algorithm and determines the performance of the learned policy. We still lack studies on how dataset characteristics influence different Offline RL algorithms. Therefore, we conducted a comprehensive empirical analysis of how dataset characteristics effect the performance of Offline RL algorithms for discrete action environments. A dataset is characterized by two metrics: (1) the Trajectory Quality (TQ) measured by the average dataset return and (2) the StateAction Coverage (SACo) measured by the number of unique stateaction pairs. We found that variants of the offpolicy Deep QNetwork family require datasets with high SACo to perform well. Algorithms that constrain the learned policy towards the given dataset perform well for datasets with high TQ or SACo. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms. 
Kajetan Schweighofer · Markus Hofmarcher · MariusConstantin Dinu · Philipp Renz · Angela Bitto · Vihang Patil · Sepp Hochreiter 🔗 


Unsupervised Learning of Temporal Abstractions using Slotbased Transformers
(Poster)
The discovery of reusable subroutines simplifies decisionmaking and planning in complex reinforcement learning problems. Previous approaches propose to learn such temporal abstractions in a purely unsupervised fashion through observing stateaction trajectories gathered from executing a policy. However, a current limitation is that they process each trajectory in an entirely sequential manner, which prevents them from revising earlier decisions about subroutine boundary points in light of new incoming information. In this work we propose SloTTAr, a fully parallel approach that integrates sequence processing Transformers with a Slot Attention module for learning about subroutines in an unsupervised fashion. We demonstrate how SloTTAr is capable of outperforming strong baselines in terms of boundary point discovery, while being up to 30x faster on existing benchmarks. 
Anand Gopalakrishnan · Kazuki Irie · Jürgen Schmidhuber · Sjoerd van Steenkiste 🔗 


CounterStrike Deathmatch with LargeScale Behavioural Cloning
(Poster)
This paper describes an AI agent that plays the modern firstpersonshooter (FPS) video game `CounterStrike; Global Offensive' (CSGO) from pixel input. The agent, a deep neural network, matches the performance of the medium difficulty builtin AI on the deathmatch game mode whilst adopting a humanlike play style. Previous research has mostly focused on games with convenient APIs and lowresolution graphics, allowing them to be run cheaply at scale. This is not the case for CSGO, with system requirements 100$\times$ that of previously studied FPS games. This limits the quantity of onpolicy data that can be generated, precluding many reinforcement learning algorithms. Our solution uses behavioural cloning — training on a large noisy dataset scraped from human play on online servers (5.5 million frames or 95 hours), and smaller datasets of clean expert demonstrations. This scale is an order of magnitude larger than prior work on imitation learning in FPS games. To introduce this challenging environment to the AI community, we open source code and datasets.

Tim Pearce · Jun Zhu 🔗 


Modern Hopfield Networks for Return Decomposition for Delayed Rewards
(Poster)
Delayed rewards, which are separated from their causative actions by irrelevant actions, hamper learning in reinforcement learning (RL). Especially real world problems often contain such delayed and sparse rewards. Recently, return decomposition for delayed rewards (RUDDER) employed pattern recognition to remove or reduce delay in rewards, which dramatically simplifies the learning task of the underlying RL method. RUDDER was realized using a long shortterm memory (LSTM). The LSTM was trained to identify important stateaction pair patterns, responsible for the return. Reward was then redistributed to these important stateaction pairs. However, training the LSTM is often difficult and requires a large number of episodes. In this work, we replace the LSTM with the recently proposed continuous modern Hopfield networks (MHN) and introduce HopfieldRUDDER. MHN are powerful trainable associative memories with large storage capacity. They require only few training samples and excel at identifying and recognizing patterns. We use this property of MHN to identify important stateaction pairs that are associated with low or high return episodes and directly redistribute reward to them. However, in partially observable environments, HopfieldRUDDER requires additional information about the history of stateaction pairs. Therefore, we evaluate several methods for compressing history and introduce resetmax history, a lightweight history compression using the maxoperator in combination with a reset gate. We experimentally show that HopfieldRUDDER is able to outperform LSTMbased RUDDER on various 1D environments with small numbers of episodes. Finally, we show in preliminary experiments that HopfieldRUDDER scales to highly complex environments with the Minecraft ObtainDiamond task from the MineRL NeurIPS challenge. 
Michael Widrich · Markus Hofmarcher · Vihang Patil · Angela Bitto · Sepp Hochreiter 🔗 


Pessimistic Modelbased Offline Reinforcement Learning under Partial Coverage
(Poster)
We study modelbased offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the models to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where the additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using densityratio based concentrability coefficients associated with individual factors. 
Masatoshi Uehara · Wen Sun 🔗 


Importance of Representation Learning for OffPolicy Fitted QEvaluation
(Poster)
The goal of offline policy evaluation (OPE) is to evaluate target policies based on logged data with a possibly much different distribution. One of the most popular empirical approaches to OPE is fitted Qevaluation (FQE). With linear function approximation, several works have found that FQE (and other OPE methods) exhibit exponential error amplification in the problem horizon, except under very strong assumptions. Given the empirical success of deep FQE, in this work we examine the effect of implicit regularization through deep architectures and loss functions on the divergence and performance of FQE. We find that divergence does occur with simple feedforward architectures, but can be mitigated using various architectures and algorithmic techniques, such as ResNet architectures, learning a shared representation between multiple target policies, and hypermodels. Our results suggest interesting directions for future work, including analyzing the effect of architecture on stability of fixedpoint updates which are ubiquitous in modern reinforcement learning. 
Carrie Wu · Nevena Lazic · Dong Yin · Cosmin Paduraru 🔗 


Offline Contextual Bandits for Wireless Network Optimization
(Poster)
The explosion in mobile data traffic together with the everincreasing expectations for higher quality of service call for the development of new AI algorithms for wireless network optimization. In this paper, we investigate how to learn policies that can automatically adjust the configuration parameters of every cell in the network in response to the changes in the user demand. Our solution combines existent methods for offline learning and adapts them in a principled way to overcome crucial challenges arising in this context. Empirical results suggest that our proposed method will achieve important performance gains when deployed in the real network while satisfying practical constraints on computational efficiency. 
Miguel Suau de Castro 🔗 


Robust OnPolicy Data Collection for DataEfficient Policy Evaluation
(Poster)
This paper considers how to complement offline reinforcement learning (RL) data with additional data collection for the task of policy evaluation. In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest. Prior work on offline policy evaluation typically only considers a static dataset. We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset. We show that simply running the evaluation policy – onpolicy data collection – is suboptimal for this setting. We then introduce two new data collection strategies for policy evaluation, both of which consider previously collected data when collecting future data so as to reduce distribution shift (or sampling error) in the entire dataset collected. Our empirical results show that compared to onpolicy sampling, our strategies produce data with lower sampling error and generally lead to lower meansquared error in policy evaluation for any total dataset size. We also show that these strategies can start from initial offpolicy data, collect additional data, and then use both the initial and new data to produce low meansquared error policy evaluation without using offpolicy corrections. 
Rujie Zhong · Josiah Hanna · Lukas Schäfer · Stefano Albrecht 🔗 


Doubly Pessimistic Algorithms for Strictly Safe OffPolicy Optimization
(Poster)
Safety in reinforcement learning (RL) has become increasingly important in recent years. Yet, many of existing solutions fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safetycritical systems. In this paper, we study offline RL in the presence of safety requirements: from a dataset collected a priori and without direct access to the true environment, learn an optimal policy that is guaranteed to respect the safety constraints. We first address this problem by modeling the safety requirement as an unknown cost function of states and actions, whose expected value with respect to the policy must fall below a certain threshold. We then present an algorithm in the context of finitehorizon Markov decision processes (MDPs), termed SafeDPVI that performs in a doubly pessimistic manner when 1) it constructs a conservative set of safe policies; and 2) when it selects a good policy from that conservative set. Without assuming the sufficient coverage of the dataset or any structure for the underlying MDPs, we establish a datadependent upper bound on the suboptimality gap of the \emph{safe} policy SafeDPVI returns. We then specialize our results to linear MDPs with appropriate assumptions on dataset being wellexplored. Both datadependent and specialized upper bounds nearly match that of stateoftheart unsafe offline RL algorithms, with an additional multiplicative factor $\frac{\sum_{h=1}^H\alpha_{h}}{H}$, where $\alpha_h$ characterizes the safety constraint at timestep $h$. We further present numerical simulations that corroborate our theoretical findings.

Sanae Amani · Lin Yang 🔗 


OFFLINE RL WITH RESOURCE CONSTRAINED ONLINE DEPLOYMENT
(Poster)
Offline reinforcement learning is used to train policies in scenarios where realtime access to the environment is expensive or impossible. As a natural consequence of these harsh conditions, an agent may lack the resources to fully observe the online environment before taking an action. We dub this situation the \newterm{resourceconstrained} setting. This leads to situations where the offline dataset (available for training) can contain fully processed features (using powerful language models, image models, complex sensors, etc.) which are not available when actions are actually taken online. This disconnect leads to an interesting and unexplored problem in offline RL: \textbf{Is it possible to use a richly processed offline dataset to train a policy which has access to fewer features in the online environment?} In this work, we introduce and formalize this novel resourceconstrained problem setting. We highlight the performance gap between policies trained using the full offline dataset and policies trained using limited features. We address this performance gap with a \newterm{policy transfer algorithm} which first trains a teacher agent using the offline dataset where features are fully available, and then transfers this knowledge to a student agent that only uses the resourceconstrained features. To better capture the challenge of this setting, we propose a data collection procedure: Resource ConstrainedDatasets for RL (RCD4RL). We evaluate our transfer algorithm on RCD4RL and the popular D4RL benchmarks and observe consistent improvement over the baseline (TD3+BC without transfer). 
Jayanth Reddy Regatti · Aniket Anand Deshmukh · Young Jung · Abhishek Gupta · Urun Dogan 🔗 


Personalization for Webbased Services using Offline Reinforcement Learning
(Poster)
Largescale Webbased services present opportunities for improving UI policies based on observed user interactions. We investigate both the sequential and nonsequential formulations, highlighting their benefits and drawbacks. In the sequential setting, we address challenges of learning such policies through modelfree offline Reinforcement Learning (RL) with offpolicy training. Deployed in a production system for user authentication in a major social network, it significantly improves longterm objectives. We articulate practical challenges, compare several ML techniques, provide insights on training and evaluation of RL models, and discuss generalizations. 
Pavlos A Apostolopoulos · Zehui Wang · Hanson Wang · Chad Zhou · Kittipat Virochsiri · Norm Zhou · Igor Markov 🔗 


Offline Reinforcement Learning with Implicit QLearning
(Poster)
Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This tradeoff is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be indistribution, or else regularize their values. We propose a new offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Qfunction with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Qfunction, without any explicit policy. Then, we extract the policy via advantageweighted behavioral cloning, which also avoids querying outofsample actions. We dub our method implicit Qlearning (IQL). IQL is easy to implement, computationally efficient, and only requires fitting an additional critic with an asymmetric L2 loss. 
Ilya Kostrikov · Ashvin Nair · Sergey Levine 🔗 


Pessimistic Model Selection for Offline Deep Reinforcement Learning
(Poster)
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in realworld scenarios. One main barrier is the overfitting issue that leads to poor generalizability of the policy learned by DRL. 
Huck Yang · Yifan Cui · PinYu Chen 🔗 


BATS: Best Action Trajectory Stitching
(Poster)
The problem of offline reinforcement learning focuses on learning a good policy from a log of environment interactions. Past efforts for developing algorithms in this area have revolved around introducing constraints to online reinforcement algorithms to ensure the actions of the learned policy are constrained to the logged data. In this work, we explore an alternative approach by planning on the fixed dataset directly. Specifically, we introduce an algorithm which forms a tabular Markov Decision Process (MDP) over the logged data by adding new transitions to the dataset. We do this by using learned dynamics models to plan short trajectories between states. Since exact value iteration can be performed on this constructed MDP, it becomes easy to identify which trajectories are advantageous to add to the MDP. Crucially, since most transitions in this MDP come from the logged data, trajectories from the MDP can be rolled out for long periods with confidence. We prove that this property allows one to make upper and lower bounds on the value function up to appropriate distance metrics. Finally, we demonstrate empirically how algorithms that uniformly constrain the learned policy to the entire dataset can result in unwanted behavior, and we show an example in which simply behavior cloning the optimal policy of the MDP created by our algorithm avoids this problem. 
Ian Char · Viraj Mehta · Adam Villaflor · John Dolan · Jeff Schneider 🔗 


SingleShot Pruning for Offline Reinforcement Learning
(Poster)
Deep Reinforcement Learning (RL) is a powerful framework for solving complex realworld problems. Large neural networks employed in the framework are traditionally associated with better generalization capabilities, but their increased size entails the drawbacks of extensive training duration, substantial hardware resources, and longer inference times. One way to tackle this problem is to prune neural networks leaving only the necessary parameters. Stateoftheart concurrent pruning techniques for imposing sparsity perform demonstrably well in applications where datadistributions are fixed. However, they have not yet been substantially explored in the context of RL. We close the gap between RL and singleshot pruning techniques and present a general pruning approach to the Offline RL. We leverage a fixed dataset to prune neural networks before the start of RL training. We then run experiments varying the network sparsity level and evaluating the validity of pruning at initialization techniques in continuous control tasks. Our results show that with 95% of the network weights pruned, OfflineRL algorithms can still retain performance in the majority of our experiments. To the best of our knowledge no prior work utilizing pruning in RL retained performance at such high levels of sparsity. Moreover, pruning at initialization techniques can be easily integrated into any existing OfflineRL algorithms without changing the learning objective. 
Samin Yeasar Arnob · riyasat.ohib · Sergey Plis · Doina Precup 🔗 


Offline neural contextual bandits: Pessimism, Optimization and Generalization
(Poster)
Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain underdeveloped. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and realworld OPL problems 
Thanh NguyenTang · Sunil Gupta · A. Tuan Nguyen · Svetha Venkatesh 🔗 


Improving Zeroshot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions
(Poster)
Reinforcement learning (RL) agents are widely used for solving complex sequential decisionmaking tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using selfsupervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoreticallymotivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using generalized value functions. We show that GSF is general enough to recover existing SSL objectives while also improving zeroshot generalization performance on a complex offline RL benchmark, offline Procgen. 
Bogdan Mazoure · Ilya Kostrikov · Ofir Nachum · Jonathan Tompson 🔗 


Adaptive Behavior Cloning Regularization for Stable OfflinetoOnline Reinforcement Learning
(Poster)
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pretrained agents may have limited performance and would further need to be finetuned online by interacting with the environment. During online finetuning, the performance of the pretrained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online finetuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online finetuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online finetuning by performing a large number of learning updates. Experiments show that the proposed method yields stateoftheart offlinetoonline reinforcement learning performance on the popular D4RL benchmark. 
Yi Zhao · Rinu Boney · Alexander Ilin · Juho Kannala · Joni Pajarinen 🔗 


What Would the Expert $do(\cdot)$?: Causal Imitation Learning
(Poster)
We develop algorithms for imitation learning from policy data that was corrupted by unobserved confounders. Sources of such confounding include (a) persistent perturbations to actions or (b) the expert responding to a part of the state that the learner does not have access to. When a confounder affects multiple timesteps of recorded data, it can manifest as spurious correlations between states and actions that a learner might latch on to, leading to poor policy performance. To break up these spurious correlations, we apply modern variants of the classical instrumental variable regression (IVR) technique, enabling us to recover the causally correct underlying policy without requiring access to an interactive expert. In particular, we present two techniques, one of a generativemodeling flavor (DoubIL) that can utilize access to a simulator and one of a gametheoretic flavor (ResiduIL) that can be run entirely offline. We discuss, from the perspective of performance, the types of confounding under which it is better to use an IVRbased technique instead of behavioral cloning and vice versa. We find both of our algorithms compare favorably to behavioral cloning on a simulated rocket landing task. 
Gokul Swamy · Sanjiban Choudhury · James Bagnell · Steven Wu 🔗 


Quantile Filtered Imitation Learning
(Poster)
We introduce quantile filtered imitation learning (QFIL), a novel policy improvement operator designed for offline reinforcement learning. QFIL performs policy improvement by running imitation learning on a filtered version of the experience dataset. The filtering process removes s,a pairs whose estimated Q values fall below a given quantile of the pushforward distribution over values induced by sampling actions from the behavior policy. The definitions of both the pushforward Q distribution and resulting value function quantile are key contributions of our method. We prove that QFIL gives us a safe policy improvement step with function approximation and that the choice of quantile provides a natural hyperparameter to trade off bias and variance of the improvement step. Empirically, we perform a synthetic experiment illustrating how QFIL effectively makes a biasvariance tradeoff and we see that QFIL performs well on the D4RL benchmark. 
David Brandfonbrener · Will Whitney · Rajesh Ranganath · Joan Bruna 🔗 


Benchmarking Sample Selection Strategies for Batch Reinforcement Learning
(Poster)
Training sample selection techniques, such as prioritized experience replay (PER), have been recognized as of significant importance for online reinforcement learn ing algorithms. Efficient sample selection can help further improve the learning efficiency and the final performance. However, the impact of sample selection for batch reinforcement learning (RL) has not been well studied. In this work, we investigate the application of nonuniform sampling techniques in batch RL. In particular, we compare six variants of PER based on various heuristic priority metrics that focus on different aspects of the offline learning setting. These metrics include temporaldifference error, nstep return, selfimitation learning objective, pseudocount, uncertainty, and likelihood. Through extensive experiments on the standard batch RL datasets, we find that nonuniform sampling is also effective in batch RL settings. Further, there is no single metric that works in all situations. The investigation also shows that it is insufficient to avoid the bootstrapping error in batch reinforcement learning by only changing the sampling scheme. 
Yuwei Fu · Di Wu · Benoit Boulet 🔗 


Dynamic Mirror Descent based Model Predictive Control for Accelerating Robot Learning
(Poster)
Recent works in Reinforcement Learning (RL) combine modelfree (Mf)RL algorithms with modelbased (Mb)RL approaches to get the best from both: asymptotic performance of MfRL and high sample efficiency of MbRL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mbtrajectory optimization with offpolicy methods for the MfRL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMDMPC) is used as the inner loop MbRL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop MfRL. We show that our formulation is generic for a broad class of MPCbased policies and objectives, and includes some of the wellknown MbMf approaches. We finally introduce a new algorithm: MirrorDescent Model Predictive RL (MDeMoRL), which uses CrossEntropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg, and hardware transfer for robust walking in a quadruped. We show that the innerloop MbRL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach. 
Utkarsh A Mishra · Soumya Samineni · Aditya Varma Sagi · Shalabh Bhatnagar Bhatnagar · Shishir N Y 🔗 


MBAIL: MultiBatch Best Action Imitation Learning utilizing Sample Transfer and Policy Distillation
(Poster)
Most online reinforcement learning (RL) algorithms require a large number of interactions with the environment to learn a reliable control policy. Unfortunately, the assumption of the availability of repeated interactions with the environment does not hold for many realworld applications. Batch RL aims to learn a good control policy from a previously collected dataset without requiring additional interactions with the environment, which are very promising in solving realworld problems. However, in the real world, we may only have a limited amount of data points for certain tasks we are interested in. Also, most of the current batch RL methods are mainly aimed to learn policy over one fixed dataset with which it is hard to learn a policy that can perform well over multiple tasks. In this work, we propose to tackle these challenges with sample transfer and policy distillation. The proposed methods are evaluated on multiple control tasks to showcase their effectiveness. 
Di Wu · tianyu.li · David Meger · Michael Jenkin · Steve Liu · Gregory Dudek 🔗 


Showing Your Offline Reinforcement Learning Work: Online Evaluation Budget Matters
(Poster)
Over the recent years, vast progress has been made in Offline Reinforcement Learning (OfflineRL) for various decisionmaking domains: from finance to robotics. However, comparing and reporting new OfflineRL algorithms has been noted as underdeveloped: (1) use of unlimited online evaluation budget for hyperparameter search (2) sidestepping offline policy selection (3) adhoc performance statistics reporting. In this work, we propose an evaluation technique addressing these issues, Expected Online Performance, that provides a performance estimate for a bestfound policy given a fixed online evaluation budget. Using our approach, we can estimate the number of online evaluations required to surpass a given behavioral policy performance. Applying it to several OfflineRL baselines, we find that with a limited online evaluation budget, (1) Behavioral Cloning constitutes a strong baseline over various expert levels and data regimes, and (2) offline uniform policy selection is competitive with valuebased approaches. We hope the proposed technique will make it into the toolsets of OfflineRL practitioners to help them arrive at informed conclusions when deploying RL in realworld systems. 
Vladislav Kurenkov · Sergey Kolesnikov 🔗 


Offline Reinforcement Learning with Munchausen Regularization
(Poster)
Most temporal differences based (TDbased) Reinforcement Learning (RL) methods focus on replacing the true value of a transiting state by their current estimate of this value. MunchausenRL (MRL) proposes the idea of incorporating the current policy to be leveraged to bootstrap RL. The concept of penalizing two consecutive policies that are far from each other is also applicable to offline settings. In our work, we add the Munchausen term in the Qupdate step to penalize policies that deviate from previous policy too far. Our results indicate that this method could be implemented in various offline Qlearning methods to help improve the performance. In addition, we evaluate how prioritized experience replay affects offline RL. Our results show that Munchausen Offline RL outperforms the original methods that are without the regularization term. 
HsinYu Liu · Bharathan Balaji · Dezhi Hong 🔗 


Importance of Empirical Sample Complexity Analysis for Offline Reinforcement Learning
(Poster)
We hypothesize that empirically studying the sample complexity of offline reinforcement learning (RL) is crucial for the practical applications of RL in the real world. Several recent works have demonstrated the ability to learn policies directly from offline data. In this work, we ask the question of the dependency on the number of samples for learning from offline data. Our objective is to emphasize that studying sample complexity for offline RL is important, and is an indicator of the usefulness of existing offline algorithms. We propose an evaluation approach for sample complexity analysis of offline RL. 
Samin Yeasar Arnob · Riashat Islam · Doina Precup 🔗 


Discrete Uncertainty Quantification Approach for Offline RL
(Poster)
In many Reinforcement Learning tasks, the classical online interaction of the learning agent with the environment is impractical, either because such interaction is expensive or dangerous. In these cases, previous gathered data can be used, arising what is typically called Offline Reinforcement Learning. However, this type of learning faces a large number of challenges, mostly derived from the fact that exploration/exploitation tradeoff is overshadowed. Instead, the historical data is usually biased by the way it was obtained, typically, a suboptimal controller, producing a distributional shift from historical data and the one required to learn the optimal policy. 
Javier Corrochano · Rubén Majadas · FERNANDO FERNANDEZ 🔗 


Pretraining for LanguageConditioned Imitation with Transformers
(Poster)
We study reinforcement learning (RL) agents which can utilize language inputs and efficiently learn on downstream tasks. To investigate this, we propose a new multimodal benchmark  TextConditioned Frostbite  in which an agent must complete tasks specified by text instructions in the Atari Frostbite environment. We curate and release a dataset of 5M textlabelled transitions for training, and to encourage further research in this direction. On this benchmark, we evaluate Text Decision Transformer (TDT), a transformer directly operating on text, state, and action tokens, and find it improves upon baseline architectures. Furthermore, we evaluate the effect of pretraining, finding unsupervised pretraining can yield improved results in lowdata settings. 
Aaron Putterman · Kevin Lu · Igor Mordatch · Pieter Abbeel 🔗 


Stateful Offline Contextual Policy Evaluation and Learning
(Poster)
We study offpolicy evaluation and learning from sequential data in a structured class of Markov decision processes that arise from repeated interactions with an exogenous sequence of arrivals with contexts, which generate unknown individuallevel responses to agent actions that induce known transitions. This is a relevant model, for example, for dynamic personalized pricing and other operations management problems in the presence of potentially highdimensional user types. The individuallevel response is not causally affected by the state variable. In this setting, we adapt doublyrobust estimation in the singletimestep setting to the sequential setting so that a statedependent policy can be learned even from a single timestep's worth of data. We introduce a \textit{marginal MDP} model and study an algorithm for offpolicy learning, which can be viewed as fitted value iteration in the marginal MDP. We also provide structural results on when errors in the response model leads to the persistence, rather than attenuation, of error over time. In simulations, we show that the advantages of doublyrobust estimation in the single timestep setting, via unbiased and lowervariance estimation, can directly translate to improved outofsample policy performance. This structurespecific analysis sheds light on the underlying structure on a class of problems, operations research/management problems, often heralded as a realworld domain for offline RL, which are in fact qualitatively easier. 
Angela Zhou 🔗 


Offline Reinforcement Learning: Fundamental Barriers for Value Function Approximation
(Poster)
We consider offline reinforcement learning, where the goal is to learn a decision making policy from logged data. Offline RL—particularly when coupled with (value) function approximation to allow for generalization in large/continuous state spaces—is becoming increasingly relevant in practice, because it avoids costly and timeconsuming online data collection and is wellsuited to safetycritical domains. Existing sample complexity guarantees for offline value function approximation methods typically require both (1) distributional assumptions (i.e., good coverage) and (2) representational assumptions (i.e., ability to represent some or all Qvalue functions) stronger than what is required for supervised learning. However, the necessity of these conditions and the fundamental limits for offline RL are not wellunderstood in spite of decades of research. This led Chen and Jiang (2019) to conjecture that concentrability (the most standard notion of coverage) and realizability (the weakest representation condition) alone are not sufficient for sampleefficient offline RL. We resolve this conjecture in the positive by proving (information theoretically) that even if both concentrability and realizability are satisfied, any algorithm requires sample complexity polynomial in the size of the state space to learn a nontrivial policy. Our results show that sampleefficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions beyond what is required in classical supervised learning, and highlight a phenomenon called overcoverage which serves as a fundamental barrier for offline value function approximation methods. 
Dylan Foster · Akshay Krishnamurthy · David SimchiLevi · Yunzong Xu 🔗 


Learning Value Functions from Undirected Stateonly Experience
(Poster)
This paper tackles the problem of learning value functions from undirected stateonly experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Qlearning in this setting. We show that tabular Qlearning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Qlearning or LAQ, an offline RL method that can learn effective value functions from stateonly experience. Latent Action Qlearning (LAQ) learns value functions using Qlearning on discrete latent actions obtained through a latentvariable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goaldirected behavior, can be used with domainspecific lowlevel controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods. 
Matthew Chang · Arjun Gupta · Saurabh Gupta 🔗 


DiscriminatorWeighted Offline Imitation Learning from Suboptimal Demonstrations
(Poster)
We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a static offline dataset of stateactionnext state transition triples from both optimal and nonoptimal expert behaviors. This strictly offline imitation learning problem arises in many realworld problems, where environment interactions and expert annotations are costly. Prior works that address the problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) based on the learned reward function. In this paper, we propose an imitation learning algorithm to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations containing large proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and nonexpert data, we propose a cooperation strategy to boost the performance of both tasks, this will result in a new policy learning objective and surprisingly, we find its equivalence to a generalized BC objective, where the outputs of discriminator serve as the weights of the BC loss function. Experimental results show that the proposed algorithm can learn behavior policies that are much closer to the optimal policies than policies learned by baseline algorithms. 
Haoran Xu · Xianyuan Zhan · Honglei Yin · 🔗 


ModelBased Offline Planning with Trajectory Pruning
(Poster)
Offline reinforcement learning (RL) enables learning policies using precollected datasets without environment interaction, which provides a promising direction to make RL usable in realworld systems. Although recent offline RL studies have achieved much progress, existing methods still face many practical challenges in realworld system control tasks, such as computational restriction during agent training and the requirement of extra control flexibility. Modelbased planning framework provides an attractive solution for such tasks. However, most modelbased planning algorithms are not designed for offline settings. Simply combining the ingredients of offline RL with existing methods either provides overrestrictive planning or leads to inferior performance. We propose a new lightweighted modelbased offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and highperformance planning. MOPP encourages more aggressive trajectory rollout guided by the behavior policy learned from data, and prunes out problematic trajectories to avoid potential outofdistribution samples. Experimental results show that MOPP provides competitive performance compared with existing modelbased offline planning and RL approaches. 
Xianyuan Zhan · Xiangyu Zhu · Haoran Xu 🔗 


TRAIL: NearOptimal Imitation Learning with Suboptimal Data
(Poster)
The aim in imitation learning is to learn effective policies by utilizing nearoptimal expert demonstrations. However, highquality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or taskagnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. Is it possible to formalize these conceptual benefits and devise algorithms to use offline datasets to yield provable improvements to the sampleefficiency of imitation learning? In this work, we study this question and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sampleefficiency of downstream imitation learning, effectively reducing the need for large nearoptimal expert datasets through the use of auxiliary nonexpert data. To learn the latent action space in practice, we propose TRAIL (TransitionReparametrized Actions for Imitation Learning), an algorithm that learns an energybased transition model contrastively, and uses the transition model to reparametrize the action space for sampleefficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to recover nearoptimal policies with fewer expert trajectories. 
Sherry Yang · Sergey Levine · Ofir Nachum 🔗 


Offline MetaReinforcement Learning for Industrial Insertion
(Poster)
Reinforcement learning (RL) can in principle make it possible for robots to automatically adapt to new tasks, but in practice current RL methods require a very large number of trials to accomplish this. In this paper, we tackle rapid adaptation to new tasks through the framework of metalearning, which utilizes past tasks to learn to adapt, with a specific focus on industrial insertion tasks. We address two specific challenges by applying metalearning in this setting. First, conventional metaRL algorithms require lengthy online metatraining phases. We show that this can be replaced with appropriately chosen offline data, resulting in an offline metaRL method that only requires demonstrations and trials from each of the prior tasks, without the need to run costly metaRL procedures online. Second, metaRL methods can fail to generalize to new tasks that are too different from those seen at metatraining time, which poses a particular challenge in industrial applications, where high success rates are critical. We address this by combining contextual metalearning with direct online finetuning: if the new task is similar to those seen in the prior data, then the contextual metalearner adapts immediately, and if it is too different, it gradually adapts through finetuning. We show that our approach is able to quickly adapt to a variety of different insertion tasks, learning how to perform them with a success rate of 100% using only a fraction of the samples needed for learning the tasks from scratch. Experiment videos and details are available at https://sites.google.com/view/odaanon. 
Tony Zhao · Jianlan Luo · Oleg Sushkov · Rugile Pevceviciute · Nicolas Heess · Jonathan Scholz · Stefan Schaal · Sergey Levine 🔗 


SimtoReal Interactive Recommendation via OffDynamics Reinforcement Learning
(Poster)
Interactive recommender systems (IRS) have received growing attention due to its awareness of longterm engagement and dynamic preference. Although the longterm planning perspective of reinforcement learning (RL) naturally fits the IRS setup, RL methods require a large amount of online user interaction, which is restricted due to economic considerations. To train agents with limited interaction data, previous works often count on building simulators to mimic user behaviors in real systems. This poses potential challenges to the success of simtoreal transfer. In practice, such transfer easily fails as user dynamics is highly unpredictable and sensitive to the type of recommendation task. To address the above issue, we propose a novel method, S2RRec, to bridge the simtoreal gap via offdynamics RL. Generally, we expect the policy learned by only interacting with the simulator can perform well in the real environment. To achieve this, we conduct dynamics adaptation to calibrate the difference of state transition using reward correction. Furthermore, we align representation discrepancy of items by representation adaptation. Instead of separating the above into two stages, we propose to jointly adapt the dynamics and representations, leading to a unified learning objective. Experiments on realworld datasets validate the superiority of our approach, which achieves about 33.18% improvements compared to the baselines. 
Junda Wu · Zhihui Xie · Tong Yu · Qizhi Li · Shuai Li 🔗 


Why so pessimistic? Estimating uncertainties for offline rl through ensembles, and why their independence matters
(Poster)
In offline/batch reinforcement learning (RL), the predominant class of approaches with most success have been ``support constraint" methods, where trained policies are encouraged to remain within the support of the provided offline dataset. However, support constraints correspond to an overly pessimistic assumption that actions outside the provided data may lead to worstcase outcomes. In this work, we aim to relax this assumption by obtaining uncertainty estimates for predicted action values, and acting conservatively with respect to a lowerconfidence bound (LCB) on these estimates. Motivated by the success of ensembles for uncertainty estimation in supervised learning, we propose MSG, an offline RL method that employs an ensemble of independently updated Qfunctions. First, theoretically, by referring to the literature on infinitewidth neural networks, we demonstrate the crucial dependence of the quality of derived uncertainties on the manner in which ensembling is performed, a phenomenon that arises due to the dynamic programming nature of RL and overlooked by existing offline RL methods. Our theoretical predictions are corroborated by pedagogical examples on toy MDPs, as well as empirical comparisons in benchmark continuous control domains. In the significantly more challenging antmaze domains of the D4RL benchmark, MSG with deep ensembles by a wide margin surpasses highly welltuned stateoftheart methods. Consequently, we investigate whether efficient approximations can be similarly effective. We demonstrate that while some very efficient variants also outperform current stateoftheart, they do not match the performance and robustness of MSG with deep ensembles. We hope that the significant impact of our less pessimistic approach engenders increased focus into uncertainty estimation techniques directed at RL, and engenders new efforts from the community of deep network uncertainty estimation researchers. 
Seyed Kamyar Seyed Ghasemipour · Shixiang (Shane) Gu · Ofir Nachum 🔗 


ExampleBased Offline Reinforcement Learning without Rewards
(Poster)
Offline reinforcement learning (RL) methods, which tackle the problem of learning a policy from a static dataset, have shown promise in deploying RL in realworld scenarios. Offline RL allows the reuse and accumulation of large datasets while mitigating safety concerns that arise in online exploration. However, prior works require humandefined reward labels to learn from offline datasets. Reward specification remains a major challenge for deep RL algorithms and also poses an issue for offline RL in the real world since designing reward functions could take considerable manual effort and also potentially requires installing extra hardware such as visual sensors on robots to detect the completion of a task. In contrast, in many settings, it is easier for users to provide examples of a completed task such as images than specifying a complex reward function. Based on this observation, we propose an algorithm that can learn behaviors from offline datasets without reward labels, instead of using a small number of example images. Our method learns a conservative classifier that directly learns a Qfunction from the offline dataset and the successful examples while penalizing the Qvalues to prevent distributional shift. Through extensive empirical results, we find that our method outperforms prior imitation learning algorithms and inverse RL methods by 53% that directly learn rewards in visionbased robot manipulation domains 
Kyle Hatch · Tianhe Yu · Rafael Rafailov · Chelsea Finn 🔗 


The Reflective Explorer: Online MetaExploration from Offline Data in Realistic Robotic Tasks
(Poster)
Reinforcement learning is difficult to apply to real world problems due to high sample complexity, the need to adapt to frequent distribution shifts and the complexities of learning from highdimensional inputs, such as images. Over the last several years, metalearning has emerged as a promising approach to tackle these problems by explicitly training an agent to quickly adapt to new tasks. However, such methods still require huge amounts of data during training and are difficult to optimize in highdimensional domains. One potential solution is to consider offline or batch metareinforcement learning (RL)  learning from existing datasets without additional environment interactions during training. In this work we develop the first offline modelbased metaRL algorithm that operates from images in tasks with sparse rewards. Our approach has three main components: a novel strategy to construct metaexploration trajectories from offline data, which allows agents to learn meaningful metatest time task inference strategy; representation learning via variational filtering and latent conservative modelfree policy optimization. We show that our method completely solves a realistic metalearning task involving robot manipulation, while naive combinations of previous approaches fail. 
Rafael Rafailov · · Tianhe Yu · Avi Singh · Mariano Phielipp · Chelsea Finn 🔗 