Skip to yearly menu bar Skip to main content


Poster

How Does Variance Shape the Regret in Contextual Bandits?

Zeyu Jia · Jian Qian · Alexander Rakhlin · Chen-Yu Wei

West Ballroom A-D #6705
[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract: We consider realizable contextual bandits with general function approximation, investigating how small reward variance can lead to better-than-minimax regret bounds. Unlike in minimax regret bounds, we show that the eluder dimension delua measure of the complexity of the function classplays a crucial role in variance-dependent bounds. We consider two types of adversary: (1) Weak adversary: The adversary sets the reward variance before observing the learner's action. In this setting, we prove that a regret of Ω(min(A,delu)Λ+delu) is unavoidable when deluAT, where A is the number of actions, T is the total number of rounds, and Λ is the total variance over T rounds. For the Adelu regime, we derive a nearly matching upper bound O~(AΛ+delu) for the special case where the variance is revealed at the beginning of each round. (2) Strong adversary: The adversary sets the reward variance after observing the learner's action. We show that a regret of Ω(deluΛ+delu) is unavoidable when deluΛ+deluAT. In this setting, we provide an upper bound of order O~(deluΛ+delu).Furthermore, we examine the setting where the function class additionally provides distributional information of the reward, as studied by Wang et al. (2024). We demonstrate that the regret bound O~(deluΛ+delu) established in their work is unimprovable when deluΛ+deluAT. However, with a slightly different definition of the total variance and with the assumption that the reward follows a Gaussian distribution, one can achieve a regret of O~(AΛ+delu).

Chat is not available.