Orals & Spotlights Track 30: Optimization/Theory
Each Oral includes Q&ASpotlights have joint Q&As
Time
2020-12-10T06:00:00-08:00 - 2020-12-10T09:00:00-08:00Session chairs
Yuxin Chen, Francis BachVideo
Chat
To ask questions please use rocketchat, available only upon registration and login.
Schedule
2020-12-10T06:00:00-08:00 - 2020-12-10T06:15:00-08:00
1 - Oral: Black-Box Ripper: Copying black-box models using generative evolutionary algorithms
Antonio Barbalau, Adrian Cosma, Radu Tudor Ionescu, Marius Popescu
We study the task of replicating the functionality of black-box neural models, for which we only know the output class probabilities provided for a set of input images. We assume back-propagation through the black-box model is not possible and its training images are not available, e.g. the model could be exposed only through an API. In this context, we present a teacher-student framework that can distill the black-box (teacher) model into a student model with minimal accuracy loss. To generate useful data samples for training the student, our framework (i) learns to generate images on a proxy data set (with images and classes different from those used to train the black-box) and (ii) applies an evolutionary strategy to make sure that each generated data sample exhibits a high response for a specific class when given as input to the black box. Our framework is compared with several baseline and state-of-the-art methods on three benchmark data sets. The empirical evidence indicates that our model is superior to the considered baselines. Although our method does not back-propagate through the black-box network, it generally surpasses state-of-the-art methods that regard the teacher as a glass-box model. Our code is available at: https://github.com/antoniobarbalau/black-box-ripper.
2020-12-10T06:15:00-08:00 - 2020-12-10T06:30:00-08:00
2 - Oral: Towards a Better Global Loss Landscape of GANs
Ruoyu Sun, Tiantian Fang, Alex Schwing
Understanding of GAN training is still very limited. One major challenge is its non-convex-non-concave min-max objective, which may lead to sub-optimal local minima. In this work, we perform a global landscape analysis of the empirical loss of GANs. We prove that a class of separable-GAN, including the original JS-GAN, has exponentially many bad basins which are perceived as mode-collapse. We also study the relativistic pairing GAN (RpGAN) loss which couples the generated samples and the true samples. We prove that RpGAN has no bad basins. Experiments on synthetic data show that the predicted bad basin can indeed appear in training. We also perform experiments to support our theory that RpGAN has a better landscape than separable-GAN. For instance, we empirically show that RpGAN performs better than separable-GAN with relatively narrow neural nets. The code is available at \url{https://github.com/AilsaF/RS-GAN}.
2020-12-10T06:30:00-08:00 - 2020-12-10T06:45:00-08:00
3 - Oral: Online Sinkhorn: Optimal Transport distances from sample streams
Arthur Mensch, Gabriel Peyré
Optimal Transport (OT) distances are now routinely used as loss functions in ML tasks. Yet, computing OT distances between arbitrary (i.e. not necessarily discrete) probability distributions remains an open problem. This paper introduces a new online estimator of entropy-regularized OT distances between two such arbitrary distributions. It uses streams of samples from both distributions to iteratively enrich a non-parametric representation of the transportation plan. Compared to the classic Sinkhorn algorithm, our method leverages new samples at each iteration, which enables a consistent estimation of the true regularized OT distance. We provide a theoretical analysis of the convergence of the online Sinkhorn algorithm, showing a nearly-1/n asymptotic sample complexity for the iterate sequence. We validate our method on synthetic 1-d to 10-d data and on real 3-d shape data.
2020-12-10T06:45:00-08:00 - 2020-12-10T07:00:00-08:00
Break
2020-12-10T07:00:00-08:00 - 2020-12-10T07:10:00-08:00
5 - Spotlight: Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses
Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, Kunal Talwar
Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al. [2016] provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses. Our work is the first to address uniform stability of SGD on nonsmooth convex losses. Specifically, we provide sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses. Our lower bounds show that, in the nonsmooth case, (S)GD can be inherently less stable than in the smooth case. On the other hand, our upper bounds show that (S)GD is sufficiently stable for deriving new and useful bounds on generalization error. Most notably, we obtain the first dimension-independent generalization bounds for multi-pass SGD in the nonsmooth case. In addition, our bound allow us to derive a new algorithm for differentially private nonsmooth stochastic convex optimization with optimal excess population risk. Our algorithm is simpler and more efficient than the best known algorithm for the nonsmooth case, due to Feldman et al. [2020].
2020-12-10T07:10:00-08:00 - 2020-12-10T07:20:00-08:00
6 - Spotlight: Optimal Approximation - Smoothness Tradeoffs for Soft-Max Functions
Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, Emmanouil Zampetakis
A soft-max function has two main efficiency measures: (1) approximation - which corresponds to how well it approximates the maximum function, (2) smoothness - which shows how sensitive it is to changes of its input. Our goal is to identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application. The most commonly used soft-max function, called exponential mechanism, has optimal tradeoff between approximation measured in terms of expected additive approximation and smoothness measured with respect to Renyi Divergence. We introduce a soft-max function, called piece-wise linear soft-max, with optimal tradeoff between approximation, measured in terms of worst-case additive approximation and smoothness, measured with respect to l_q-norm. The worst-case approximation guarantee of the piece-wise linear mechanism enforces sparsity in the output of our soft-max function, a property that is known to be important in Machine Learning applications Martins et al. 16, Laha et al. 18 and is not satisfied by the exponential mechanism. Moreover, the l_q-smoothness is suitable for applications in Mechanism Design and Game Theory where the piece-wise linear mechanism outperforms the exponential mechanism. Finally, we investigate another soft-max function, called power mechanism, with optimal tradeoff between expected multiplicative approximation and smoothness with respect to the Renyi Divergence, which provides improved theoretical and practical results in differentially private submodular optimization.
2020-12-10T07:20:00-08:00 - 2020-12-10T07:30:00-08:00
7 - Spotlight: Conformal Symplectic and Relativistic Optimization
Guilherme Franca, Jeremias Sulam, Daniel Robinson, Rene Vidal
Arguably, the two most popular accelerated or momentum-based optimization methods are Nesterov's accelerated gradient and Polyaks's heavy ball, both corresponding to different discretizations of a particular second order differential equation with a friction term. Such connections with continuous-time dynamical systems have been instrumental in demystifying acceleration phenomena in optimization. Here we study structure-preserving discretizations for a certain class of dissipative (conformal) Hamiltonian systems, allowing us to analyze the symplectic structure of both Nesterov and heavy ball, besides providing several new insights into these methods. Moreover, we propose a new algorithm based on a dissipative relativistic system that normalizes the momentum and may result in more stable/faster optimization. Importantly, such a method generalizes both Nesterov and heavy ball, each being recovered as distinct limiting cases, and has potential advantages at no additional cost.
2020-12-10T07:30:00-08:00 - 2020-12-10T07:40:00-08:00
8 - Spotlight: Random Reshuffling is Not Always Better
Christopher De Sa
Many learning algorithms, such as stochastic gradient descent, are affected by the order in which training examples are used. It is often observed that sampling the training examples without-replacement, also known as random reshuffling, causes learning algorithms to converge faster. We give a counterexample to the Operator Inequality of Noncommutative Arithmetic and Geometric Means, a longstanding conjecture that relates to the performance of random reshuffling in learning algorithms (Recht and Ré, "Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences," COLT 2012). We use this to give an example of a learning task and algorithm for which with-replacement random sampling actually outperforms random reshuffling.
2020-12-10T07:40:00-08:00 - 2020-12-10T07:50:00-08:00
Q&A: Joint Q&A for Preceeding Spotlights
2020-12-10T07:50:00-08:00 - 2020-12-10T08:00:00-08:00
10 - Spotlight: The Statistical Complexity of Early-Stopped Mirror Descent
Tomas Vaskevicius, Varun Kanade, Patrick Rebeschini
Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completing an inequality that characterizes convexity for the squared loss, we identify an intrinsic link between offset Rademacher complexities and potential-based convergence analysis of mirror descent methods. Our observation immediately yields excess risk guarantees for the path traced by the iterates of mirror descent in terms of offset complexities of certain function classes depending only on the choice of the mirror map, initialization point, step-size, and the number of iterations. We apply our theory to recover, in a rather clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings.
2020-12-10T08:00:00-08:00 - 2020-12-10T08:10:00-08:00
11 - Spotlight: Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree
Peizhong Ju, Xiaojun Lin, Jia Liu
Recently, there have been significant interests in studying the so-called "double-descent" of the generalization error of linear regression models under the overparameterized and overfitting regime, with the hope that such analysis may provide the first step towards understanding why overparameterized deep neural networks (DNN) still generalize well. However, to date most of these studies focused on the min L2-norm solution that overfits the data. In contrast, in this paper we study the overfitting solution that minimizes the L1-norm, which is known as Basis Pursuit (BP) in the compressed sensing literature. Under a sparse true linear regression model with p i.i.d. Gaussian features, we show that for a large range of p up to a limit that grows exponentially with the number of samples n, with high probability the model error of BP is upper bounded by a value that decreases with p. To the best of our knowledge, this is the first analytical result in the literature establishing the double-descent of overfitting BP for finite n and p. Further, our results reveal significant differences between the double-descent of BP and min L2-norm solutions. Specifically, the double-descent upper-bound of BP is independent of the signal strength, and for high SNR and sparse models the descent-floor of BP can be much lower and wider than that of min L2-norm solutions.
2020-12-10T08:10:00-08:00 - 2020-12-10T08:20:00-08:00
12 - Spotlight: Towards Problem-dependent Optimal Learning Rates
Yunbei Xu, Assaf Zeevi
We study problem-dependent rates, i.e., generalization errors that scale tightly with the variance or the effective loss at the "best hypothesis." Existing uniform convergence and localization frameworks, the most widely used tools to study this problem, often fail to simultaneously provide parameter localization and optimal dependence on the sample size. As a result, existing problem-dependent rates are often rather weak when the hypothesis class is "rich" and the worst-case bound of the loss is large. In this paper we propose a new framework based on a "uniform localized convergence" principle. We provide the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general "rich" classes; we also establish improved loss-dependent rate for standard empirical risk minimization.
2020-12-10T08:20:00-08:00 - 2020-12-10T08:30:00-08:00
13 - Spotlight: On Uniform Convergence and Low-Norm Interpolation Learning
Lijia Zhou, D.J. Sutherland, Nati Srebro
We consider an underdetermined noisy linear regression model where the minimum-norm interpolating predictor is known to be consistent, and ask: can uniform convergence in a norm ball, or at least (following Nagarajan and Kolter) the subset of a norm ball that the algorithm selects on a typical input set, explain this success? We show that uniformly bounding the difference between empirical and population errors cannot show any learning in the norm ball, and cannot show consistency for any set, even one depending on the exact algorithm and distribution. But we argue we can explain the consistency of the minimal-norm interpolator with a slightly weaker, yet standard, notion: uniform convergence of zero-error predictors in a norm ball. We use this to bound the generalization error of low- (but not minimal-)norm interpolating predictors.
2020-12-10T08:30:00-08:00 - 2020-12-10T08:40:00-08:00
Q&A: Joint Q&A for Preceeding Spotlights
2020-12-10T08:40:00-08:00 - 2020-12-10T09:00:00-08:00