Session
Algorithms
A Linear-Time Kernel Goodness-of-Fit Test
Wittawat Jitkrittum · Wenkai Xu · Zoltan Szabo · Kenji Fukumizu · Arthur Gretton
We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.
Generalization Properties of Learning with Random Features
Alessandro Rudi · Lorenzo Rosasco
We study the generalization properties of ridge regression with random features in the statistical learning framework. We show for the first time that $O(1/\sqrt{n})$ learning bounds can be achieved with only $O(\sqrt{n}\log n)$ random features rather than $O({n})$ as suggested by previous results. Further, we prove faster learning rates and show that they might require more random features, unless they are sampled according to a possibly problem dependent distribution. Our results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.
Communication-Efficient Distributed Learning of Discrete Distributions
Ilias Diakonikolas · Elena Grigorescu · Jerry Li · Abhiram Natarajan · Krzysztof Onak · Ludwig Schmidt
We initiate a systematic study of distribution learning (or density estimation) in the distributed model. In this problem the data drawn from an unknown distribution is partitioned across multiple machines. The machines must succinctly communicate with a referee so that in the end the referee can estimate the underlying distribution of the data. The problem is motivated by the pressing need to build communication-efficient protocols in various distributed systems, where power consumption or limited bandwidth impose stringent communication constraints. We give the first upper and lower bounds on the communication complexity of nonparametric density estimation of discrete probability distributions under both l1 and the l2 distances. Specifically, our results include the following: 1. In the case when the unknown distribution is arbitrary and each machine has only one sample, we show that any interactive protocol that learns the distribution must essentially communicate the entire sample. 2. In the case of structured distributions, such as $k$-histograms and monotone, we design distributed protocols that achieve better communication guarantees than the trivial ones, and show tight bounds in some regimes.
Posterior sampling for reinforcement learning: worst-case regret bounds
Shipra Agrawal · Randy Jia
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(D\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\tilde{O}(DS\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\Omega(\sqrt{DSAT})$ for this problem.
The dueling bandit is a learning framework where the feedback information in the learning process is restricted to noisy comparison between a pair of actions. In this paper, we address a dueling bandit problem based on a cost function over a continuous space. We propose a stochastic mirror descent algorithm and show that the algorithm achieves an $O(¥sqrt{T¥log T})$-regret bound under strong convexity and smoothness assumptions for the cost function. Then, we clarify the equivalence between regret minimization in dueling bandit and convex optimization for the cost function. Moreover, considering a lower bound in convex optimization, it is turned out that our algorithm achieves the optimal convergence rate in convex optimization and the optimal regret in dueling bandit except for a logarithmic factor.
Minimal Exploration in Structured Stochastic Bandits
Richard Combes · Stefan Magureanu · Alexandre Proutiere
This paper introduces and addresses a wide class of stochastic bandit problems where the function mapping the arm to the corresponding reward exhibits some known structural properties. Most existing structures (e.g. linear, lipschitz, unimodal, combinatorial, dueling,...) are covered by our framework. We derive an asymptotic instance-specific regret lower bound for these problems, and develop OSSB, an algorithm whose regret matches this fundamental limit. OSSB is not based on the classical principle of ``optimism in the face of uncertainty'' or on Thompson sampling, and rather aims at matching the minimal exploration rates of sub-optimal arms as characterized in the derivation of the regret lower bound. We illustrate the efficiency of OSSB using numerical experiments in the case of the linear bandit problem and show that OSSB outperforms existing algorithms, including Thompson sampling
Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe
Quentin Berthet · Vianney Perchet
We consider the problem of bandit optimization, inspired by stochastic optimization and online learning problems with bandit feedback. In this problem, the objective is to minimize a global loss function of all the actions, not necessarily a cumulative loss. This framework allows us to study a very general class of problems, with applications in statistics, machine learning, and other fields. To solve this problem, we analyze the Upper-Confidence Frank-Wolfe algorithm, inspired by techniques for bandits and convex optimization. We give theoretical guarantees for the performance of this algorithm over various classes of functions, and discuss the optimality of these results.
Diving into the shallows: a computational perspective on large-scale shallow learning
SIYUAN MA · Mikhail Belkin
Remarkable recent success of deep neural networks has not been easy to analyze theoretically. It has been particularly hard to disentangle relative significance of architecture and optimization in achieving accurate classification on large datasets. On the flip side, shallow methods (e.g. kernel methods) have encountered obstacles in scaling to large data. Practical methods, such as variants of gradient descent used so successfully in deep learning, seem to perform below par when applied to kernel methods. This difficulty has sometimes been attributed to the limitations of shallow architecture. In this paper we first identify a basic limitation in gradient descent-based optimization methods when used in conjunctions with smooth kernels. An analysis demonstrates that only a vanishingly small fraction of the function space is reachable after a polynomial number of gradient descent iterations. That drastically limits approximating power of gradient descent for a fixed computational budget and leading to serious over-regularization. The issue is purely algorithmic, persisting even in the limit of infinite data. To address this shortcoming in practice, we introduce EigenPro iteration, based on a simple and direct preconditioning scheme using a small number of approximate eigenvectors. It can also be viewed as learning a new kernel optimized for gradient descent. It turns out that injecting this small amount of approximate second-order information leads to major improvements in convergence. For large data, this translates into significant performance boost over the state-of-the-art for kernel methods. In particular, we are able to match or improve the results recently reported in the literature at a small fraction of their computational budget. Finally, we feel that these results show a need for a broader computational perspective on modern large-scale learning to complement more traditional statistical and convergence analyses.
Monte-Carlo Tree Search by Best Arm Identification
Emilie Kaufmann · Wouter Koolen
Recent advances in bandit tools and techniques for sequential learning are steadily enabling new applications and are promising the resolution of a range of challenging related problems. We study the game tree search problem, where the goal is to quickly identify the optimal move in a given game tree by sequentially sampling its stochastic payoffs. We develop new algorithms for trees of arbitrary depth, that operate by summarizing all deeper levels of the tree into confidence intervals at depth one, and applying a best arm identification procedure at the root. We prove new sample complexity guarantees with a refined dependence on the problem instance. We show experimentally that our algorithms outperform existing elimination-based algorithms and match previous special-purpose methods for depth-two trees.
A framework for Multi-A(rmed)/B(andit) Testing with Online FDR Control
Fanny Yang · Aaditya Ramdas · Kevin Jamieson · Martin Wainwright
We propose an alternative framework to existing setups for controlling false alarms when multiple A/B tests are run over time. This setup arises in many practical applications, e.g. when pharmaceutical companies test new treatment options against control pills for different diseases, or when internet companies test their default webpages versus various alternatives over time. Our framework proposes to replace a sequence of A/B tests by a sequence of best-arm MAB instances, which can be continuously monitored by the data scientist. When interleaving the MAB tests with an an online false discovery rate (FDR) algorithm, we can obtain the best of both worlds: low sample complexity and any time online FDR control. Our main contributions are: (i) to propose reasonable definitions of a null hypothesis for MAB instances; (ii) to demonstrate how one can derive an always-valid sequential p-value that allows continuous monitoring of each MAB test; and (iii) to show that using rejection thresholds of online-FDR algorithms as the confidence levels for the MAB algorithms results in both sample-optimality, high power and low FDR at any point in time. We run extensive simulations to verify our claims, and also report results on real data collected from the New Yorker Cartoon Caption contest.
Parameter-Free Online Learning via Model Selection
Dylan J Foster · Satyen Kale · Mehryar Mohri · Karthik Sridharan
We introduce a new framework for deriving efficient algorithms that obtain model selection oracle inequalities in the adversarial online learning setting, also sometimes described as parameter-free online learning. While work in this area has focused on specific, highly-structured, function classes, such as nested balls in a Hilbert space, we eschew this approach and propose a generic meta-algorithm framework which achieves oracle inequalities under minimal structural assumptions. This allows us to derive new computationally efficient algorithms with oracle bounds for a wide range of settings where such results were previously unavailable. We give the first computationally efficient algorithms which work in arbitrary Banach spaces under mild smoothness assumptions --- previous results only applied to the Hilbert case. We further derive new oracle inequalities for various matrix classes, non-nested convex sets, and $\R^{d}$ with generic regularizers. Finally, we generalize further by providing oracle inequalities for arbitrary non-linear classes in the contextual learning model; in particular, we give new algorithms for learning with multiple kernels. These results are all derived through a unified meta-algorithm scheme based on a novel "multi-scale" algorithm for prediction with expert advice based on random playout, which may be of independent interest.
Bregman Divergence for Stochastic Variance Reduction: Saddle-Point and Adversarial Prediction
Zhan Shi · Xinhua Zhang · Yaoliang Yu
Adversarial machines, where a learner competes against an adversary, have re-gained much recent interest in machine learning. They are naturally in the form of saddle-point optimization, often with separable structure but sometimes also with unmanageably large dimension. In this work we show that adversarial prediction under multivariate losses can be solved much faster than they used to be. We first reduce the problem size exponentially by using appropriate sufficient statistics, and then we adapt the new stochastic variance-reduced algorithm of Balamurugan & Bach (2016) to allow any Bregman divergence. We prove that the same linear rate of convergence is retained and we show that for adversarial prediction using KL-divergence we can further achieve a speedup of #example times compared with the Euclidean alternative. We verify the theoretical findings through extensive experiments on two example applications: adversarial prediction and LPboosting.
Gaussian Quadrature for Kernel Features
Tri Dao · Christopher M De Sa · Christopher Ré
Kernel methods have recently attracted resurgent interest, matching the performance of deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that $O(\epsilon^{-2})$ samples are required to achieve an approximation error of at most $\epsilon$. In this paper, we investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve comparable accuracy to the state-of-the-art kernel methods based on random Fourier features.
Online Learning of Linear Dynamical Systems
Elad Hazan · Karan Singh · Cyril Zhang
We present an efficient and practical algorithm for the online prediction of discrete-time linear dynamical systems. Despite the non-convex optimization problem, using improper learning and convex relaxation our algorithm comes with provable guarantees: it has near-optimal regret bounds compared to the best LDS in hindsight, while overparameterizing by only a small logarithmic factor. Our analysis brings together ideas from improper learning by convex relaxations, online regret minimization, and the spectral theory of Hankel matrices.