Timezone: »
We propose and analyze a reinforcement learning principle thatapproximates the Bellman equations by enforcing their validity onlyalong a user-defined space of test functions. Focusing onapplications to model-free offline RL with function approximation, weexploit this principle to derive confidence intervals for off-policyevaluation, as well as to optimize over policies within a prescribedpolicy class. We prove an oracle inequality on our policyoptimization procedure in terms of a trade-off between the value anduncertainty of an arbitrary comparator policy. Different choices oftest function spaces allow us to tackle different problems within acommon framework. We characterize the loss of efficiency in movingfrom on-policy to off-policy data using our procedures, and establishconnections to concentrability coefficients studied in past work. Weexamine in depth the implementation of our methods with linearfunction approximation, and provide theoretical guarantees withpolynomial-time implementations even when Bellman closure does nothold.
Author Information
Andrea Zanette (University of California, Berkeley)
Martin J Wainwright (UC Berkeley)
More from the Same Authors
-
2013 Poster: Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima »
Po-Ling Loh · Martin J Wainwright -
2013 Poster: Information-theoretic lower bounds for distributed statistical estimation with communication constraints »
Yuchen Zhang · John Duchi · Michael Jordan · Martin J Wainwright -
2013 Spotlight: Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima »
Po-Ling Loh · Martin J Wainwright -
2013 Oral: Information-theoretic lower bounds for distributed statistical estimation with communication constraints »
Yuchen Zhang · John Duchi · Michael Jordan · Martin J Wainwright -
2013 Poster: Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation »
John Duchi · Martin J Wainwright · Michael Jordan