Skip to yearly menu bar Skip to main content


San Diego Oral Session

Oral 3A Reinforcement/State-space 2

Exhibit Hall F,G,H

Moderators: Reinhard Heckel · Shangtong Zhang

Thu 4 Dec 10 a.m. PST — 11 a.m. PST
Abstract:
Chat is not available.

Thu 4 Dec. 10:00 - 10:20 PST

State Entropy Regularization for Robust Reinforcement Learning

Yonatan Ashlag · Uri Koren · Mirco Mutti · Esther Derman · Pierre-Luc Bacon · Shie Mannor

State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.

Thu 4 Dec. 10:20 - 10:40 PST

A Clean Slate for Offline Reinforcement Learning

Matthew T Jackson · Uljad Berdica · Jarek Liesen · Shimon Whiteson · Jakob Foerster

Progress in offline reinforcement learning (RL) has been impeded by ambiguous problem definitions and entangled algorithmic designs, resulting in inconsistent implementations, insufficient ablations, and unfair evaluations. Although offline RL explicitly avoids environment interaction, prior methods frequently employ extensive, undocumented online evaluation for hyperparameter tuning, complicating method comparisons. Moreover, existing reference implementations differ significantly in boilerplate code, obscuring their core algorithmic contributions. We address these challenges by first introducing a rigorous taxonomy and a transparent evaluation protocol that explicitly quantifies online tuning budgets. To resolve opaque algorithmic design, we provide clean, minimalistic, single-file implementations of various model-free and model-based offline RL methods, significantly enhancing clarity and achieving substantial speed-ups. Leveraging these streamlined implementations, we propose Unifloral, a unified algorithm that encapsulates diverse prior approaches and enables development within a single, comprehensive hyperparameter space. Using Unifloral with our rigorous evaluation protocol, we develop two novel algorithms - TD3-AWR (model-free) and MoBRAC (model-based) - which substantially outperform established baselines. Our implementation is publicly available at https://github.com/EmptyJackson/unifloral.

Thu 4 Dec. 10:40 - 11:00 PST

Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies

Felix Chalumeau · Daniel Rajaonarivonivelomanantsoa · Ruan John de Kock · Juan Formanek · Sasha Abramowitz · Omayma Mahjoub · Wiem Khlifi · Simon Du Toit · Louay Nessir · Refiloe Shabe · Arnol Fokam · Siddarth Singh · Ulrich Armel Mbou Sob · Arnu Pretorius

Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. We make all of our experimental data and code available.