Skip to yearly menu bar Skip to main content


Poster

Worst-Case Offline Reinforcement Learning with Arbitrary Data Support

Kohei Miyaguchi

West Ballroom A-D #6307
[ ]
Wed 11 Dec 11 a.m. PST — 2 p.m. PST

Abstract: We propose a method of offline reinforcement learning (RL) featuring the performance guarantee without any assumptions on the data support. Under such conditions, estimating or optimizing the conventional performance metric is generally infeasible due to the distributional discrepancy between data and target policy distributions. To address this issue, we employ a worst-case policy value as a new metric and constructively show that the sample complexity bound of $O(\epsilon^{−2})$ is attainable without any data-support conditions, where $\epsilon>0$ is the policy suboptimality in the new metric. Moreover, as the new metric generalizes the conventional one, the algorithm can address standard offline RL tasks without modification. In this context, our sample complexity bound can be seen as a strict improvement on the previous bounds under the single-policy concentrability and the single-policy realizability.

Live content is unavailable. Log in and register to view live content