Timezone: »

Workshop
Manfred Díaz · Hiroki Furuta · Elise van der Pol · Lisa Lee · Shixiang (Shane) Gu · Pablo Samuel Castro · Simon Du · Marc Bellemare · Sergey Levine

Tue Dec 14 05:00 AM -- 02:20 PM (PST) @

This workshop builds connections between different areas of RL centered around the understanding of algorithms and their context. We are interested in questions such as, but not limited to: (i) How can we gauge the complexity of an RL problem?, (ii) Which classes of algorithms can tackle which classes of problems?, and (iii) How can we develop practically applicable guidelines for formulating RL tasks that are tractable to solve? We expect submissions that address these and other related questions through an ecological and data-centric view, pushing forward the limits of our comprehension of the RL problem.

 Tue 5:00 a.m. - 5:10 a.m. Introductory Remarks (Intro) 🔗 Tue 5:10 a.m. - 5:30 a.m. Artificial what? (Invited Talk) Shane Legg 🔗 Tue 5:30 a.m. - 5:40 a.m. Shane Legg (Live Q&A)  link » 🔗 Tue 5:40 a.m. - 6:00 a.m. What makes for an interesting RL problem? (Invited Talk) Joelle Pineau 🔗 Tue 6:00 a.m. - 6:10 a.m. Joelle Pineau (Live Q&A)  link » 🔗 Tue 6:10 a.m. - 6:25 a.m. HyperDQN: A Randomized Exploration Method for Deep Reinforcement Learning (Oral)    Randomized least-square value iteration (RLSVI) is a provably efficient exploration method. However, it is limited to the case where 1) a good feature is known in advance and 2) this feature is fixed during the training: if otherwise, RLSVI suffers an unbearable computational burden to obtain the posterior samples of the parameter in the $Q$-value function. In this work, we present a practical algorithm named HyperDQN, addressing these two issues under the context of deep reinforcement learning, where the feature changes over iterations. HyperDQN is built on two parametric models: in addition to a non-linear neural network (i.e., base model) that predicts $Q$-values, our method employs a probabilistic hypermodel (i.e., meta model), which outputs the parameter of the base model. When both models are jointly optimized under a specifically designed objective, three purposes can be achieved. First, the hypermodel can generate approximate posterior samples regarding the parameter of the $Q$-value function. As a result, diverse $Q$-value functions are sampled to select exploratory action sequences. This retains the punchline of RLSVI for efficient exploration. Second, a good feature is learned to approximate $Q$-value functions. This addresses limitation 1. Third, the posterior samples of the $Q$-value function can be obtained in a more efficient way than the existing methods, and the changing feature does not affect the efficiency. This deals with limitation 2. On the Atari 2600 suite, after $20$M samples, HyperDQN achieves about $2 \times$ improvements over (double) DQN, the advanced method Bootstrapped DQN, and the SOTA exploration bonus method OB2I. For another challenging task SuperMarioBros, HyperDQN outperforms baselines on $7$ out of $9$ games. Ziniu Li · Yingru Li · Yushun Zhang · Tong Zhang · Zhiquan Luo 🔗 Tue 6:25 a.m. - 6:40 a.m. Grounding an Ecological Theory of Artificial Intelligence in Human Evolution (Oral)    Recent advances in Artificial Intelligence (AI) have revived the quest for agents able to acquire an open-ended repertoire of skills. Although this ability is fundamentally related to the characteristics of human intelligence, research in this field rarely considers the processes and ecological conditions that may have guided the emergence of complex cognitive capacities during the evolution of the species. Research in Human Behavioral Ecology (HBE) seeks to understand how the behaviors characterizing human nature can be conceived as adaptive responses to major changes in our ecological niche. In this paper, we propose a framework highlighting the role of environmental complexity in open-ended skill acquisition, grounded in major hypotheses from HBE and recent contributions in Reinforcement learning (RL). We use this framework to highlight fundamental links between the two disciplines, as well as to identify feedback loops that bootstrap ecological complexity and create promising research directions for AI researchers. We also present our first steps towards designing a simulation environment that implements the climate dynamics necessary for studying key HBE hypotheses relating environmental complexity to skill acquisition. Eleni Nisioti · Clément Moulin-Frier 🔗 Tue 6:40 a.m. - 6:50 a.m. Virtual Coffee Break (Break)  link » Come and join us on the virtual lounge over GatherTown for a small break Link » 🔗 Tue 6:50 a.m. - 7:10 a.m. Sculpting (human-like) AI systems by sculpting their (social) environments (Invited Talk) Pierre-Yves Oudeyer 🔗 Tue 7:10 a.m. - 7:20 a.m. Pierre-Yves Oudeyer (Live Q&A)  link » 🔗 Tue 7:20 a.m. - 7:40 a.m. Towards RL applications in video games and with human users (Invited Talk) Katja Hofmann 🔗 Tue 7:40 a.m. - 7:50 a.m. Katja Hofmann (Live Q&A)  link » 🔗 Tue 7:50 a.m. - 8:05 a.m. Habitat 2.0: Training Home Assistants to Rearrange their Habitat (Oral)    We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack – data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850× real-time) on an 8-GPU node, representing 100× speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from ‘hand-off problems’, and (3) SPA pipelines are more brittle than RL policies. Andrew Szot · Alexander Clegg · Eric Undersander · Erik Wijmans · Yili Zhao · Noah Maestre · Mustafa Mukadam · Oleksandr Maksymets · Aaron Gokaslan · Sameer Dharur · Franziska Meier · Wojciech Galuba · Angel Chang · Zsolt Kira · Vladlen Koltun · Jitendra Malik · Manolis Savva · Dhruv Batra 🔗 Tue 8:05 a.m. - 8:20 a.m. Embodied Intelligence via Learning and Evolution (Contributed Talk) Agrim Gupta 🔗 Tue 8:20 a.m. - 8:40 a.m. A Methodology for RL Environment Research (Invited Talk) Daniel Tanis 🔗 Tue 8:40 a.m. - 8:50 a.m. Daniel Tanis (Live Q&A)  link » 🔗 Tue 8:50 a.m. - 9:00 a.m. Virtual Coffee Break (Break)  link » 🔗 Tue 9:00 a.m. - 10:00 a.m. Virtual Poster Session (Poster Session)  link » 🔗 Tue 10:00 a.m. - 10:20 a.m. Environment Capacity (Invited Talk) Benjamin Van Roy 🔗 Tue 10:20 a.m. - 10:30 a.m. Benjamin van Roy (Live Q&A)  link » 🔗 Tue 10:30 a.m. - 10:50 a.m. A Universal Framework for Reinforcement Learning (Invited Talk) Warren Powell 🔗 Tue 10:50 a.m. - 11:00 a.m. Warren Powell (Live Q&A)  link » 🔗 Tue 11:00 a.m. - 11:15 a.m. Representation Learning for Online and Offline RL in Low-rank MDPs (Oral)    This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner. We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix. Unlike prior works that assume the representation is known (e.g., linear MDPs), here we need to learn the representation for the low-rank MDP. We study both the online RL and offline RL settings. For the online setting, operating with the same computational oracles used in FLAMBE (Agarwal et.al), the state-of-art algorithm for learning representations in low-rank MDPs, we propose an algorithm REP-UCB Upper Confidence Bound driven Representation learning for RL), which significantly improves the sample complexity from $\widetilde{O}( A^9 d^7 / (\epsilon^{10} (1-\gamma)^{22}))$ for FLAMBE to $\widetilde{O}( A^4 d^4 / (\epsilon^2 (1-\gamma)^{2}) )$ with $d$ being the rank of the transition matrix (or dimension of the ground truth representation), $A$ being the number of actions, and $\gamma$ being the discounted factor. Notably, REP-UCB is simpler than FLAMBE, as it directly balances the interplay between representation learning, exploration, and exploitation, while FLAMBE is an explore-then-commit style approach and has to perform reward-free exploration step-by-step forward in time. For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition: our algorithm is able to compete against any policy as long as it is covered by the offline distribution. Masatoshi Uehara · Xuezhou Zhang · Wen Sun 🔗 Tue 11:15 a.m. - 11:30 a.m. Understanding the Effects of Dataset Composition on Offline Reinforcement Learning (Oral)    The promise of Offline Reinforcement Learning (RL) lies in learning policies from fixed datasets, without interacting with the environment. Being unable to interact makes the dataset the most essential ingredient of the algorithm, as it directly affects the learned policies. Studies on how the dataset composition influences various Offline RL algorithms are missing. Towards that end, we conducted a comprehensive empirical analysis on the effect of dataset composition towards the performance of Offline RL algorithms for discrete action environments. The performance is studied through two metrics of the datasets, Trajectory Quality (TQ) and State-Action Coverage (SACo). Our analysis suggests that variants of the off-policy Deep-Q-Network family rely on the dataset to exhibit high SACo. Contrary to that, algorithms that constrain the learned policy towards the data generating policy perform well across datasets, if they exhibit high TQ or SACo or both. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms. Kajetan Schweighofer · Markus Hofmarcher · Marius-Constantin Dinu · Angela Bitto · Philipp Renz · Vihang Patil · Sepp Hochreiter 🔗 Tue 11:30 a.m. - 11:50 a.m. Structural Assumptions for Better Generalization in Reinforcement Learning (Invited Talk) Amy Zhang 🔗 Tue 11:50 a.m. - 12:00 p.m. Amy Zhang (Live Q&A)  link » 🔗 Tue 12:00 p.m. - 12:10 p.m. Virtual Coffee Break (Break)  link » 🔗 Tue 12:10 p.m. - 12:30 p.m. Reinforcement learning: It's all in the mind (Invited Talk) Tom Griffiths 🔗 Tue 12:30 p.m. - 12:40 p.m. Tom Griffiths (Live Q&A)  link » 🔗 Tue 12:40 p.m. - 1:00 p.m. Curriculum-based Learning: An Effective Approach for Acquiring Dynamic Skills (Invited Talk) Michiel van de Panne 🔗 Tue 1:00 p.m. - 1:10 p.m. Michiel van de Panne (Live Q&A)  link » 🔗 Tue 1:10 p.m. - 2:00 p.m. Live Panel Discussion (Discussion Panel)  link » 🔗 Tue 2:00 p.m. - 2:15 p.m. BIG-Gym: A Crowd-Sourcing Challenge for RL Environments and Behaviors (Launch) 🔗 Tue 2:15 p.m. - 2:20 p.m. Closing Remarks (Remarks) 🔗