NeurIPS 2020 : CoinDICE: Off-Policy Confidence Interval Estimation



CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvari, Dale Schuurmans

Spotlight presentation: Orals & Spotlights Track 04: Reinforcement Learning
on 2020-12-07T19:10:00-08:00 - 2020-12-07T19:20:00-08:00

Poster Session 1 (more posters)
on 2020-12-07T21:00:00-08:00 - 2020-12-07T23:00:00-08:00

Toggle Abstract Paper (in Proceedings / .pdf)

Abstract: We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the Q-function, we obtain an optimization problem with generalized estimating equation constraints. By applying the generalized empirical likelihood method to the resulting Lagrangian, we propose CoinDICE, a novel and efficient algorithm for computing confidence intervals. Theoretically, we prove the obtained confidence intervals are valid, in both asymptotic and finite-sample regimes. Empirically, we show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.

CoinDICE: Off-Policy Confidence Interval Estimation

Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvari, Dale Schuurmans

Preview Video and Chat

To see video, interact with the author and ask questions please use registration and login.