Poster
Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff
Jian Qian · Haichen Hu · David Simchi-Levi
West Ballroom A-D #6404
Abstract:
Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression \citep{simchi2020bypassing}, we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon (as known as CMDP with layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only calls to an offline density estimation algorithm (or oracle) across all rounds. This number can be further reduced to if is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.
Chat is not available.