NeurIPS Poster Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Poster

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Jian Qian · Haichen Hu · David Simchi-Levi

West Ballroom A-D #6404

[ Abstract ]

[ Paper] [ OpenReview]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract: Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression \citep{simchi2020bypassing}, we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon

H

$H$ (as known as CMDP with

H

$H$ layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class

M

$\mathcal{M}$ containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only

O (H \log T)

$O(H \log T)$ calls to an offline density estimation algorithm (or oracle) across all

T

$T$ rounds. This number can be further reduced to

O (H \log \log T)

$O(H \log \log T)$ if

T

$T$ is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.

Chat is not available.