Timezone: »
How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay (HER) or returns-to-go in Decision Transformer (DT) -- enables efficient learning of context-conditioned policies, where at times online RL can be fully replaced by offline behavioral cloning (BC), e.g. sequence modeling. Inspired by distributional and state-marginal matching literatures in RL, we demonstrate that all these approaches are essentially doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches a given future state information statistics.We first present Distributional Decision Transformer (DDT) and its practical instantiation, Categorical DT, and show that this simple modification to DT can enable effective offline state-marginal matching that generalizes well to unseen, even synthetic multi-modal, reward or state-feature distributions.We perform experiments on Gym's MuJoCo continuous control benchmarks and empirically validate performances. Additionally, we present and test another simple modification to DT called Unsupervised DT (UDT), show its connection to distribution matching, inverse RL and representation learning, and empirically demonstrate their effectiveness for offline imitation learning. To the best of our knowledge, DDT and UDT together constitute the first successes for offline state-marginal matching and inverse-RL imitation learning, allowing us to propose first benchmarks for these two important subfields and greatly expand the role of powerful sequence modeling architectures in modern RL.
Author Information
Hiroki Furuta (The University of Tokyo)
Yutaka Matsuo (University of Tokyo)
Shixiang (Shane) Gu (Google Brain, University of Cambridge)
More from the Same Authors
-
2021 Spotlight: Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization »
Yusuke Iwasawa · Yutaka Matsuo -
2021 Spotlight: A Minimalist Approach to Offline Reinforcement Learning »
Scott Fujimoto · Shixiang (Shane) Gu -
2022 : Control Graph as Unified IO for Morphology-Task Generalization »
Hiroki Furuta · Yusuke Iwasawa · Yutaka Matsuo · Shixiang (Shane) Gu -
2022 : Control Graph as Unified IO for Morphology-Task Generalization »
Hiroki Furuta · Yusuke Iwasawa · Yutaka Matsuo · Shixiang (Shane) Gu -
2023 Poster: DreamSparse: Escaping from Plato’s Cave with 2D Diffusion Model Given Sparse Views »
Paul Yoo · Jiaxian Guo · Yutaka Matsuo · Shixiang (Shane) Gu -
2022 Poster: Large Language Models are Zero-Shot Reasoners »
Takeshi Kojima · Shixiang (Shane) Gu · Machel Reid · Yutaka Matsuo · Yusuke Iwasawa -
2022 Poster: Langevin Autoencoders for Learning Deep Latent Variable Models »
Shohei Taniguchi · Yusuke Iwasawa · Wataru Kumagai · Yutaka Matsuo -
2021 Workshop: Ecological Theory of Reinforcement Learning: How Does Task Design Influence Agent Learning? »
Manfred Díaz · Hiroki Furuta · Elise van der Pol · Lisa Lee · Shixiang (Shane) Gu · Pablo Samuel Castro · Simon Du · Marc Bellemare · Sergey Levine -
2021 Poster: A Minimalist Approach to Offline Reinforcement Learning »
Scott Fujimoto · Shixiang (Shane) Gu -
2021 Poster: Co-Adaptation of Algorithmic and Implementational Innovations in Inference-based Deep Reinforcement Learning »
Hiroki Furuta · Tadashi Kozuno · Tatsuya Matsushima · Yutaka Matsuo · Shixiang (Shane) Gu -
2021 Poster: Test-Time Classifier Adjustment Module for Model-Agnostic Domain Generalization »
Yusuke Iwasawa · Yutaka Matsuo -
2018 Poster: Data-Efficient Hierarchical Reinforcement Learning »
Ofir Nachum · Shixiang (Shane) Gu · Honglak Lee · Sergey Levine -
2017 Poster: Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning »
Shixiang (Shane) Gu · Timothy Lillicrap · Richard Turner · Zoubin Ghahramani · Bernhard Schölkopf · Sergey Levine -
2015 Poster: Particle Gibbs for Infinite Hidden Markov Models »
Nilesh Tripuraneni · Shixiang (Shane) Gu · Hong Ge · Zoubin Ghahramani -
2015 Poster: Neural Adaptive Sequential Monte Carlo »
Shixiang (Shane) Gu · Zoubin Ghahramani · Richard Turner