Track: Wed Track 3 -- Session 1

Wed 5 Dec. 6:45 - 6:50 PST

Spotlight

Revisiting $(\epsilon, \gamma, \tau)$-similarity learning for domain adaptation

Sofiane Dhouib ⋅ Ievgen Redko

Similarity learning is an active research area in machine learning that tackles the problem of finding a similarity function tailored to an observable data sample in order to achieve efficient classification. This learning scenario has been generally formalized by the means of a $(\epsilon, \gamma, \tau)-$good similarity learning framework in the context of supervised classification and has been shown to have strong theoretical guarantees. In this paper, we propose to extend the theoretical analysis of similarity learning to the domain adaptation setting, a particular situation occurring when the similarity is learned and then deployed on samples following different probability distributions. We give a new definition of an $(\epsilon, \gamma)-$good similarity for domain adaptation and prove several results quantifying the performance of a similarity function on a target domain after it has been trained on a source domain. We particularly show that if the source distribution dominates the target one, then principally new domain adaptation learning bounds can be proved.

Wed 5 Dec. 6:50 - 6:55 PST

Spotlight

Leveraged volume sampling for linear regression

Michal Derezinski ⋅ Manfred K. Warmuth ⋅ Daniel Hsu

Suppose an n x d design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number k << n of the responses, and then produce a weight vector whose sum of squares loss over all points is at most 1+epsilon times the minimum. When k is very small (e.g., k=d), jointly sampling diverse subsets of points is crucial. One such method called "volume sampling" has a unique and desirable property that the weight vector it produces is an unbiased estimate of the optimum. It is therefore natural to ask if this method offers the optimal unbiased estimate in terms of the number of responses k needed to achieve a 1+epsilon loss approximation.

Surprisingly we show that volume sampling can have poor behavior when we require a very accurate approximation -- indeed worse than some i.i.d. sampling techniques whose estimates are biased, such as leverage score sampling. We then develop a new rescaled variant of volume sampling that produces an unbiased estimate which avoids this bad behavior and has at least as good a tail bound as leverage score sampling: sample size k=O(d log d + d/epsilon) suffices to guarantee total loss at most 1+epsilon times the minimum with high probability. Thus, we improve on the best previously known sample size for an unbiased estimator, k=O(d^2/epsilon).

Our rescaling procedure leads to a new efficient algorithm for volume sampling which is based on a "determinantal rejection sampling" technique with potentially broader applications to determinantal point processes. Other contributions include introducing the combinatorics needed for rescaled volume sampling and developing tail bounds for sums of dependent random matrices which arise in the process.

Wed 5 Dec. 6:55 - 7:00 PST

Spotlight

Synthesize Policies for Transfer and Adaptation across Tasks and Environments

Hexiang Hu ⋅ Liyu Chen ⋅ Boqing Gong ⋅ Fei Sha

The ability to transfer in reinforcement learning is key towards building an agent of general artificial intelligence. In this paper, we consider the problem of learning to simultaneously transfer across both environments and tasks, probably more importantly, by learning from only sparse (environment, task) pairs out of all the possible combinations. We propose a novel compositional neural network architecture which depicts a meta rule for composing policies from environment and task embeddings. Notably, one of the main challenges is to learn the embeddings jointly with the meta rule. We further propose new training methods to disentangle the embeddings, making them both distinctive signatures of the environments and tasks and effective building blocks for composing the policies. Experiments on GridWorld and THOR, of which the agent takes as input an egocentric view, show that our approach gives rise to high success rates on all the (environment, task) pairs after learning from only 40% of them.

Wed 5 Dec. 7:00 - 7:05 PST

Spotlight

Sublinear Time Low-Rank Approximation of Distance Matrices

Ainesh Bakshi ⋅ David Woodruff

Let $\PP=\{ p_1, p_2, \ldots p_n \}$ and $\QQ = \{ q_1, q_2 \ldots q_m \}$ be two point sets in an arbitrary metric space. Let $\AA$ represent the $m\times n$ pairwise distance matrix with $\AA_{i,j} = d(p_i, q_j)$. Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric $d$, it is possible to achieve an additive error low rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices $\AA$, and our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling. We then show that in general, relative error approximation in sublinear time is impossible for distance matrices, even if one allows for bicriteria solutions. Additionally, we show that if $\PP = \QQ$ and $d$ is the squared Euclidean distance, which is not a metric but rather the square of a metric, then a relative error bicriteria solution can be found in sublinear time. Finally, we empirically compare our algorithm with the SVD and input sparsity time algorithms. Our algorithm is several hundred times faster than the SVD, and about $8$-$20$ times faster than input sparsity methods on real-world and and synthetic datasets of size $10^8$. Accuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal) and input-sparsity time algorithms.

Wed 5 Dec. 7:05 - 7:20 PST

Oral

Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes

Hassan Ashtiani ⋅ Shai Ben-David ⋅ Nicholas Harvey ⋅ Christopher Liaw ⋅ Abbas Mehrabian ⋅ Yaniv Plan

We prove that ϴ(k d^2 / ε^2) samples are necessary and sufficient for learning a mixture of k Gaussians in R^d, up to error ε in total variation distance. This improves both the known upper bounds and lower bounds for this problem. For mixtures of axis-aligned Gaussians, we show that O(k d / ε^2) samples suffice, matching a known lower bound.

The upper bound is based on a novel technique for distribution learning based on a notion of sample compression. Any class of distributions that allows such a sample compression scheme can also be learned with few samples. Moreover, if a class of distributions has such a compression scheme, then so do the classes of products and mixtures of those distributions. The core of our main result is showing that the class of Gaussians in R^d has an efficient sample compression.

Wed 5 Dec. 7:20 - 7:25 PST

Spotlight

Minimax Statistical Learning with Wasserstein distances

Jaeho Lee ⋅ Maxim Raginsky

As opposed to standard empirical risk minimization (ERM), distributionally robust optimization aims to minimize the worst-case risk over a larger ambiguity set containing the original empirical distribution of the training data. In this work, we describe a minimax framework for statistical learning with ambiguity sets given by balls in Wasserstein space. In particular, we prove generalization bounds that involve the covering number properties of the original ERM problem. As an illustrative example, we provide generalization guarantees for transport-based domain adaptation problems where the Wasserstein distance between the source and target domain distributions can be reliably estimated from unlabeled samples.

Wed 5 Dec. 7:25 - 7:30 PST

Spotlight

Generalization Bounds for Uniformly Stable Algorithms

Vitaly Feldman ⋅ Jan Vondrak

Uniform stability of a learning algorithm is a classical notion of algorithmic stability introduced to derive high-probability bounds on the generalization error (Bousquet and Elisseeff, 2002). Specifically, for a loss function with range bounded in $[0,1]$, the generalization error of $\gamma$-uniformly stable learning algorithm on $n$ samples is known to be at most $O((\gamma +1/n) \sqrt{n \log(1/\delta)})$ with probability at least $1-\delta$. Unfortunately, this bound does not lead to meaningful generalization bounds in many common settings where $\gamma \geq 1/\sqrt{n}$. At the same time the bound is known to be tight only when $\gamma = O(1/n)$. Here we prove substantially stronger generalization bounds for uniformly stable algorithms without any additional assumptions. First, we show that the generalization error in this setting is at most $O(\sqrt{(\gamma + 1/n) \log(1/\delta)})$ with probability at least $1-\delta$. In addition, we prove a tight bound of $O(\gamma^2 + 1/n)$ on the second moment of the generalization error. The best previous bound on the second moment of the generalization error is $O(\gamma + 1/n)$. Our proofs are based on new analysis techniques and our results imply substantially stronger generalization guarantees for several well-studied algorithms.

Wed 5 Dec. 7:30 - 7:35 PST

Spotlight

A loss framework for calibrated anomaly detection

Aditya Menon ⋅ Robert Williamson

Given samples from a probability distribution, anomaly detection is the problem of determining if a given point lies in a low-density region. This paper concerns calibrated anomaly detection, which is the practically relevant extension where we additionally wish to produce a confidence score for a point being anomalous. Building on a classification framework for anomaly detection, we show how minimisation of a suitably modified proper loss produces density estimates only for anomalous instances. We then show how to incorporate quantile control by relating our objective to a generalised version of the pinball loss. Finally, we show how to efficiently optimise the objective with kernelised scorer, by leveraging a recent result from the point process literature. The resulting objective captures a close relative of the one-class SVM as a special case.

Wed 5 Dec. 7:35 - 7:40 PST

Spotlight

Sharp Bounds for Generalized Uniformity Testing

Ilias Diakonikolas ⋅ Daniel M. Kane ⋅ Alistair Stewart

We study the problem of generalized uniformity testing of a discrete probability distribution: Given samples from a probability distribution p over an unknown size discrete domain Ω, we want to distinguish, with probability at least 2/3, between the case that p is uniform on some subset of Ω versus ε-far, in total variation distance, from any such uniform distribution. We establish tight bounds on the sample complexity of generalized uniformity testing. In more detail, we present a computationally efficient tester whose sample complexity is optimal, within constant factors, and a matching worst-case information-theoretic lower bound. Specifically, we show that the sample complexity of generalized uniformity testing is Θ(1/(ε^(4/3) ||p||_3) + 1/(ε^2 ||p||_2 )).

Wed 5 Dec. 7:40 - 7:45 PST

Spotlight

Convex Elicitation of Continuous Properties

Jessica Finocchiaro ⋅ Rafael Frongillo

A property or statistic of a distribution is said to be elicitable if it can be expressed as the minimizer of some loss function in expectation. Recent work shows that continuous real-valued properties are elicitable if and only if they are identifiable, meaning the set of distributions with the same property value can be described by linear constraints. From a practical standpoint, one may ask for which such properties do there exist convex loss functions. In this paper, in a finite-outcome setting, we show that in fact every elicitable real-valued property can be elicited by a convex loss function. Our proof is constructive, and leads to convex loss functions for new properties.