Timezone: »

Optimal Algorithms for Stochastic Contextual Preference Bandits
Aadirupa Saha

Tue Dec 07 08:30 AM -- 10:00 AM (PST) @
We consider the problem of preference bandits in the contextual setting. At each round, the learner is presented with a context set of $K$ items, chosen randomly from a potentially infinite set of arms $\mathcal D \subseteq \mathbf R^d$. However, unlike classical contextual bandits, our framework only allows the learner to receive feedback in terms of item preferences: At each round, the learner is allowed to play a subset of size $q$ (any $q \in \{2,\ldots,K\}$) upon which only a (noisy) winner of the subset is revealed. Yet, same as the classical setup, the goal is still to compete against the best context arm at each round. The problem is relevant in various online decision-making scenarios, including recommender systems, information retrieval, tournament ranking--typically any application where it's easier to elicit the items' relative strength instead of their absolute scores. To the best of our knowledge, this work is the first to consider preference-based stochastic contextual bandits for potentially infinite decision spaces. We start with presenting two algorithms for the special case of pairwise preferences $(q=2)$: The first algorithm is simple and easy to implement with an $\tilde O(d\sqrt{T})$ regret guarantee, while the second algorithm is shown to achieve the optimal $\tilde O(\sqrt{dT})$ regret, as follows from our $\Omega(\sqrt {dT})$ matching lower bound analysis. We then proceed to analyze the problem for any general $q$-subsetwise preferences ($q \ge 2$), where surprisingly, our lower bound proves the fundamental performance limit to be $\Omega(\sqrt{d T})$ yet again, independent of the subsetsize $q$. Following this, we propose a matching upper bound algorithm justifying the tightness of our results. This implies having access to subsetwise preferences does not help in faster information aggregation for our feedback model. All the results are corroborated empirically against existing baselines.

Author Information

Aadirupa Saha (Microsoft Research)

Aadirupa Saha is a PhD student at the department of Computer Science and Automation (CSA), Indian Institute of Science (IISc), Bangalore and was a research intern at Google, Mountain View, CA (June-Sept, 2019). Her research interests broadly lie in the areas of Machine Learning, Statistical Learning Theory and Optimization. Her current research specifically focuses on decision making under uncertainty from sequential data, reinforcement learning, and preference based rank aggregation problems.

More from the Same Authors