Timezone: »

Replay-Guided Adversarial Environment Design
Minqi Jiang · Michael Dennis · Jack Parker-Holder · Jakob Foerster · Edward Grefenstette · Tim Rocktäschel

Thu Dec 09 08:30 AM -- 10:00 AM (PST) @ None #None
Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising self-supervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent's capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR$^{\perp}$, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR$^{\perp}$ improves the performance of PAIRED, from which it inherited its theoretical framework.

Author Information

Minqi Jiang (UCL & FAIR)
Michael Dennis (University of California Berkeley)

Michael Dennis is a 5th year grad student at the Center for Human-Compatible AI. With a background in theoretical computer science, he is working to close the gap between decision theoretic and game theoretic recommendations and the current state of the art approaches to robust RL and multi-agent RL. The overall aim of this work is to ensure that our systems behave in a way that is robustly beneficial. In the single agent setting, this means making decisions and managing risk in the way the designer intends. In the multi-agent setting, this means ensuring that the concerns of the designer and those of others in the society are fairly and justly negotiated to the benefit of all involved.

Jack Parker-Holder (University of Oxford)
Jakob Foerster (University of Oxford)

Jakob Foerster received a CIFAR AI chair in 2019 and is starting as an Assistant Professor at the University of Toronto and the Vector Institute in the academic year 20/21. During his PhD at the University of Oxford, he helped bring deep multi-agent reinforcement learning to the forefront of AI research and interned at Google Brain, OpenAI, and DeepMind. He has since been working as a research scientist at Facebook AI Research in California, where he will continue advancing the field up to his move to Toronto. He was the lead organizer of the first Emergent Communication (EmeCom) workshop at NeurIPS in 2017, which he has helped organize ever since.

Edward Grefenstette (Facebook AI Research & University College London)
Tim Rocktäschel (Facebook AI Research)

More from the Same Authors