Timezone: »

LAPO: Latent-Variable Advantage-Weighted Policy Optimization for Offline Reinforcement Learning
Xi Chen · Ali Ghadirzadeh · Tianhe Yu · Jianhao Wang · Alex Yuan Gao · Wenzhe Li · Liang Bin · Chelsea Finn · Chongjie Zhang


Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new samples. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets often contain action distributions with multiple modes and, in some cases, lack a sufficient number of high-reward trajectories, which render offline policy training inefficient. To address this challenge, we propose to leverage latent-variable generative model to represent high-advantage state-action pairs leading to better adherence to data distributions that contributes to solving the task, while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49\% on heterogeneous datasets, and by 8\% on datasets with narrow and biased distributions.

Author Information

Xi Chen (Tsinghua University)
Ali Ghadirzadeh (Stanford University)
Tianhe Yu (Stanford University)
Jianhao Wang (Tsinghua University)
Alex Yuan Gao (The Curious AI Company)
Wenzhe Li (Tsinghua University)
Liang Bin
Chelsea Finn (Stanford)
Chongjie Zhang (Tsinghua University)

More from the Same Authors