NeurIPS Poster How Does Variance Shape the Regret in Contextual Bandits?

Poster

How Does Variance Shape the Regret in Contextual Bandits?

Zeyu Jia · Jian Qian · Alexander Rakhlin · Chen-Yu Wei

West Ballroom A-D #6705

[ Abstract ]

[ Paper] [ OpenReview]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract: We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax regret bounds, we show that the eluder dimension

d_{elu}

$d_{\text{elu}}$

-

$-$ a measure of the complexity of the function class

-

$-$ plays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of

Ω (\sqrt{min (A, d_{elu}) Λ} + d_{elu})

$\Omega( \sqrt{ \min (A, d_{\text{elu}}) \Lambda } + d_{\text{elu}} )$ is unavoidable when

d_{elu} \leq \sqrt{A T}

$d_{\text{elu}} \leq \sqrt{A T}$ , where

A

$A$ is the number of actions,

T

$T$ is the total number of rounds, and

Λ

$\Lambda$ is the total variance over

T

$T$ rounds. For the

A \leq d_{elu}

$A\leq d_{\text{elu}}$ regime, we derive a nearly matching upper bound

\tilde{O} (\sqrt{A Λ} + d_{elu})

$\tilde{O}( \sqrt{ A\Lambda } + d_{\text{elu} } )$ for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of

Ω (\sqrt{d_{elu} Λ} + d_{elu})

$\Omega( \sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} )$ is unavoidable when

\sqrt{d_{elu} Λ} + d_{elu} \leq \sqrt{A T}

$\sqrt{ d_{\text{elu}} \Lambda } + d_{\text{elu}} \leq \sqrt{A T}$ . In this setting, we provide an upper bound of order

\tilde{O} (d_{elu} \sqrt{Λ} + d_{elu})

$\tilde{O}( d_{\text{elu}}\sqrt{ \Lambda } + d_{\text{elu}} )$ .Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound

\tilde{O} (\sqrt{d_{elu} Λ} + d_{elu})

$\tilde{O}(\sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}})$ established in their work is unimprovable when

\sqrt{d_{elu} Λ} + d_{elu} \leq \sqrt{A T}

$\sqrt{d_{\text{elu}} \Lambda} + d_{\text{elu}}\leq \sqrt{AT}$ . However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of

\tilde{O} (\sqrt{A Λ} + d_{elu})

$\tilde{O}(\sqrt{A\Lambda} + d_{\text{elu}})$ .

Chat is not available.