Timezone: »

Nathan Kallus: Efficiently Breaking the Curse of Horizon with Double Reinforcement Learning
Nathan Kallus

Fri Dec 13 02:00 PM -- 02:40 PM (PST) @ None

Off-policy evaluation (OPE) is crucial for reinforcement learning in domains like medicine with limited exploration, but OPE is also notoriously difficult because the similarity between trajectories generated by any proposed policy and the observed data diminishes exponentially as horizons grow, known as the curse of horizon. To understand precisely when this curse bites, we consider for the first time the semi-parametric efficiency limits of OPE in Markov decision processes (MDP), establishing the best-possible estimation errors and characterizing the curse as a problem-dependent phenomenon rather than method-dependent. Efficiency in OPE is crucial because, without exploration, we must use the available data to its fullest. In finite horizons, this shows standard doubly-robust (DR) estimators are in fact inefficient for MDPs. In infinite horizons, while the curse renders certain problems fundamentally intractable, OPE may be feasible in ergodic time-invariant MDPs. We develop the first OPE estimator that achieves the efficiency limits in both setting, termed Double Reinforcement Learning (DRL). In both finite and infinite horizons, DRL improves upon existing estimators, which we show are inefficient, and leverages problem structure to its fullest in the face of the curse of horizon. We establish many favorable characteristics for DRL including efficiency even when nuisances are estimated slowly by blackbox models, finite-sample guarantees, and model double robustness.

Author Information

Nathan Kallus (Cornell University)

More from the Same Authors