### Session

## Orals & Spotlights Track 24: Learning Theory

### Avrim Blum · Steve Hanneke

**Improved guarantees and a multiple-descent curve for Column Subset Selection and the Nystrom method**

Michal Derezinski · Rajiv Khanna · Michael Mahoney

The Column Subset Selection Problem (CSSP) and the Nystrom method
are among the leading tools for constructing small low-rank
approximations of large datasets in machine learning and scientific
computing. A fundamental question in this area is: how well can a data subset of
size k compete with the best rank k approximation?
We develop techniques which exploit spectral properties of the data
matrix to obtain improved approximation guarantees which go beyond the
standard worst-case analysis.
Our approach leads to significantly better bounds for datasets with
known rates of singular value decay, e.g., polynomial or exponential decay.

Our analysis also reveals an intriguing phenomenon: the approximation
factor as a function of k may exhibit multiple peaks and valleys,
which we call a multiple-descent curve.

A lower bound we establish shows that this behavior is not an artifact
of our analysis, but rather it is an inherent property of the CSSP and
Nystrom tasks. Finally, using the example of a radial basis function (RBF)
kernel, we show that both our improved bounds and the multiple-descent
curve can be observed on real datasets simply by varying the RBF parameter.

**Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs**

Chung-Wei Lee · Haipeng Luo · Chen-Yu Wei · Mengxiao Zhang

We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality.

Besides its simplicity, our approach enjoys several advantages. First, the obtained high-probability regret bounds are data-dependent and could be much smaller than the worst-case bounds, which resolves an open problem asked by Neu (2015). Second, resolving another open problem of Bartlett et al. (2008) and Abernethy and Rakhlin (2009), our approach leads to the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, while previous methods are either inefficient or only applicable to specific action sets. Finally, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem.

**Worst-Case Analysis for Randomly Collected Data**

Justin Chen · Gregory Valiant · Paul Valiant

We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]={1,...,n} with corresponding data values x*1,...,x*n. We observe the values for a "sample" set A \subset [n] and wish to estimate some statistic of the values for a "target" set B \subset [n] where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known distribution P over pairs of subsets of [n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution P from which the sample A and target sets B are drawn, and the worst-case is with respect to the data values x*1,...,x*n. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample valuesâ€“-where the weights are functions of the distribution P and the sample and target sets A, B--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We also extend these results to the linear regression setting where each datapoint is not a scalar but a labeled vector (x*i,y*i). This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform analysis for the many natural settings where membership in a sample may be correlated with data values, such as when individuals are recruited into a sample through their social networks as in "snowball/chain" sampling or when samples have chronological structure as in "selective prediction".

**On Adaptive Distance Estimation**

Yeshwanth Cherapanamjeri · Jelani Nelson

We provide a static data structure for distance estimation which supports {\it adaptive} queries. Concretely, given a dataset $X = \{x_i\}_{i = 1}^n$ of $n$ points in $\mathbb{R}^d$ and $0 < p \leq 2$, we construct a randomized data structure with low memory consumption and query time which, when later given any query point $q \in \mathbb{R}^d$, outputs a $(1+\varepsilon)$-approximation of $\|q - x_i\|_p$ with high probability for all $i\in[n]$. The main novelty is our data structure's correctness guarantee holds even when the sequence of queries can be chosen adaptively: an adversary is allowed to choose the $j$th query point $q_j$ in a way that depends on the answers reported by the data structure for $q_1,\ldots,q_{j-1}$. Previous randomized Monte Carlo methods do not provide error guarantees in the setting of adaptively chosen queries. Our memory consumption is $\tilde O(nd/\varepsilon^2)$, slightly more than the $O(nd)$ required to store $X$ in memory explicitly, but with the benefit that our time to answer queries is only $\tilde O(\varepsilon^{-2}(n + d))$, much faster than the naive $\Theta(nd)$ time obtained from a linear scan in the case of $n$ and $d$ very large. Here $\tilde O$ hides $\log(nd/\varepsilon)$ factors. We discuss applications to nearest neighbor search and nonparametric estimation. Our method is simple and likely to applicable to other domains: we describe a generic approach for transforming randomized Monte Carlo data structures which do not support adaptive queries to ones that do, and show that for the problem at hand it can be applied to standard nonadaptive solutions to $\ell_p$ norm estimation with negligible overhead in query time and a factor $d$ overhead in memory.

**Tight First- and Second-Order Regret Bounds for Adversarial Linear Bandits**

Shinji Ito · Shuichi Hirahara · Tasuku Soma · Yuichi Yoshida

We propose novel algorithms with first- and second-order regret bounds for adversarial linear bandits. These regret bounds imply that our algorithms perform well when there is an action achieving a small cumulative loss or the loss has a small variance. In addition, we need only assumptions weaker than those of existing algorithms; our algorithms work on discrete action sets as well as continuous ones without a priori knowledge about losses, and they run efficiently if a linear optimization oracle for the action set is available. These results are obtained by combining optimistic online optimization, continuous multiplicative weight update methods, and a novel technique that we refer to as distribution truncation. We also show that the regret bounds of our algorithms are tight up to polylogarithmic factors.

**Delay and Cooperation in Nonstochastic Linear Bandits**

Shinji Ito · Daisuke Hatano · Hanna Sumita · Kei Takemura · Takuro Fukunaga · Naonori Kakimura · Ken-Ichi Kawarabayashi

This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decision-making problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed $d$ rounds after the choice of an action, and propose an algorithm for which the expected regret is $\tilde{O}( \sqrt{m (m + d) T} )$, ignoring logarithmic factors in $m$ and $T$, where $m$ and $T$ denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of $\Omega(\sqrt{m (m+d) T})$ in the worst case. To develop the algorithm, we introduce a technique we refer to as \textit{distribution truncation}, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [17] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.

**Unreasonable Effectiveness of Greedy Algorithms in Multi-Armed Bandit with Many Arms**

Mohsen Bayati · Nima Hamidi · Ramesh Johari · Khashayar Khosravi

We study the structure of regret-minimizing policies in the {\em many-armed} Bayesian multi-armed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$. We first show that {\em subsampling} is a critical step for designing optimal policies. In particular, the standard UCB algorithm leads to sub-optimal regret bounds in the many-armed regime. However, a subsampled UCB (SS-UCB), which samples $\Theta(\sqrt{T})$ arms and executes UCB only on that subset, is rate-optimal. Despite theoretically optimal regret, even SS-UCB performs poorly due to excessive exploration of suboptimal arms. In particular, in numerical experiments SS-UCB performs worse than a simple greedy algorithm (and its subsampled version) that pulls the current empirical best arm at every time period. We show that these insights hold even in a contextual setting, using real-world data. These empirical results suggest a novel form of {\em free exploration} in the many-armed regime that benefits greedy algorithms. We theoretically study this new source of free exploration and find that it is deeply connected to the distribution of a certain tail event for the prior distribution of arm rewards. This is a fundamentally distinct phenomenon from free exploration as discussed in the recent literature on contextual bandits, where free exploration arises due to variation in contexts. We use this insight to prove that the subsampled greedy algorithm is rate-optimal for Bernoulli bandits when $k > \sqrt{T}$, and achieves sublinear regret with more general distributions. This is a case where theoretical rate optimality does not tell the whole story: when complemented by the empirical observations of our paper, the power of greedy algorithms becomes quite evident. Taken together, from a practical standpoint, our results suggest that in applications it may be preferable to use a variant of the greedy algorithm in the many-armed regime.

**Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition**

Tiancheng Jin · Haipeng Luo

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a ``best-of-both-worlds'' guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with \tilde{O}(\sqrt{T}) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves \tilde{O}(\sqrt{C}) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Follow-the-Regularized-Leader method from Zimin and Neu (2013), with a novel hybrid regularizer inspired by recent works of Zimmert et al. (2019a, 2019b) for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.

**A Tight Lower Bound and Efficient Reduction for Swap Regret**

Shinji Ito

Swap regret, a generic performance measure of online decision-making algorithms, plays an important role in the theory of repeated games, along with a close connection to correlated equilibria in strategic games. This paper shows an $\Omega( \sqrt{T N\log{N}} )$-lower bound for swap regret, where $T$ and $N$ denote the numbers of time steps and available actions, respectively. Our lower bound is tight up to a constant, and resolves an open problem mentioned, e.g., in the book by Nisan et al. (2007). Besides, we present a computationally efficient reduction method that converts no-external-regret algorithms to no-swap-regret algorithms. This method can be applied not only to the full-information setting but also to the bandit setting and provides a better regret bound than previous results.

**Estimation of Skill Distribution from a Tournament**

Ali Jadbabaie · Anuran Makur · Devavrat Shah

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of outcomes of games are governed by the parametric Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown, non-parametric skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a surprisingly simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, so long as the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove information theoretic lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a careful covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which in turn provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e.g., soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis.

**Optimal Prediction of the Number of Unseen Species with Multiplicity**

Yi Hao · Ping Li

Based on a sample of size $n$, we consider estimating the number of symbols that appear at least $\mu$ times in an independent sample of size $a \cdot n$, where $a$ is a given parameter. This formulation includes, as a special case, the well-known problem of inferring the number of unseen species introduced by [Fisher et al.] in 1943 and considered by many others. Of considerable interest in this line of works is the largest $a$ for which the quantity can be accurately predicted. We completely resolve this problem by determining the limit of estimation to be $a \approx (\log n)/\mu$, with both lower and upper bounds matching up to constant factors. For the particular case of $\mu = 1$, this implies the recent result by [Orlitsky et al.] on the unseen species problem. Experimental evaluations show that the proposed estimator performs exceptionally well in practice. Furthermore, the estimator is a simple linear combination of symbols' empirical counts, and hence linear-time computable.

**Estimating Rank-One Spikes from Heavy-Tailed Noise via Self-Avoiding Walks**

Jingqiu Ding · Samuel Hopkins · David Steurer

We study symmetric spiked matrix models with respect to a general class of noise distributions. Given a rank-1 deformation of a random noise matrix, whose entries are independently distributed with zero mean and unit variance, the goal is to estimate the rank-1 part. For the case of Gaussian noise, the top eigenvector of the given matrix is a widely-studied estimator known to achieve optimal statistical guarantees, e.g., in the sense of the celebrated BBP phase transition. However, this estimator can fail completely for heavy-tailed noise.

In this work, we exhibit an estimator that works for heavy-tailed noise up to the BBP threshold that is optimal even for Gaussian noise. We give a non-asymptotic analysis of our estimator which relies only on the variance of each entry remaining constant as the size of the matrix grows: higher moments may grow arbitrarily fast or even fail to exist. Previously, it was only known how to achieve these guarantees if higher-order moments of the noises are bounded by a constant independent of the size of the matrix.

Our estimator can be evaluated in polynomial time by counting self-avoiding walks via a color coding technique. Moreover, we extend our estimator to spiked tensor models and establish analogous results.