Timezone: »

Workshop
OPT2020: Optimization for Machine Learning
Courtney Paquette · Mark Schmidt · Sebastian Stich · Quanquan Gu · Martin Takac

Fri Dec 11 03:15 AM -- 04:30 PM (PST) @

Optimization lies at the heart of many machine learning algorithms and enjoys great interest in our community. Indeed, this intimate relation of optimization with ML is the key motivation for the OPT series of workshops.

Looking back over the past decade, a strong trend is apparent: The intersection of OPT and ML has grown to the point that now cutting-edge advances in optimization often arise from the ML community. The distinctive feature of optimization within ML is its departure from textbook approaches, in particular, its focus on a different set of goals driven by "big-data, nonconvexity, and high-dimensions," where both theory and implementation are crucial.

We wish to use OPT 2020 as a platform to foster discussion, discovery, and dissemination of the state-of-the-art in optimization as relevant to machine learning. And well beyond that: as a platform to identify new directions and challenges that will drive future research, and continue to build the OPT+ML joint research community.

Invited Speakers
Volkan Cevher (EPFL)
Michael Friedlander (UBC)
Donald Goldfarb (Columbia)
Andreas Krause (ETH, Zurich)
Suvrit Sra (MIT)
Rachel Ward (UT Austin)
Ashia Wilson (MSR)
Tong Zhang (HKUST)

Instructions

To see all submitted paper and posters, go to the "opt-ml website" at the top of the page.

Use RocketChat or Zoom link (top of page) if you want to ask the speaker a direct question during the Live Q&A and Contributed Talks.

 Fri 3:15 a.m. - 3:50 a.m. Welcome event (gather.town) (Social event/Break)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗 Fri 3:50 a.m. - 4:00 a.m. Welcome remarks to Session 1 (Opening remarks) Sebastian Stich 🔗 Fri 4:00 a.m. - 4:20 a.m. Invited speaker: The Convexity of Learning Infinite-width Deep Neural Networks, Tong Zhang (Talk)    Deep learning has received considerable empirical successes in recent years. Although deep neural networks (DNNs) are highly nonconvex with respect to the model parameters, it has been observed that the training of overparametrized DNNs leads to consistent solutions that are highly reproducible with different random initializations. I will explain this phenomenon by modeling DNNs using feature representations, and show that the optimization landscape is convex with respect to the features. Moreover, we show that optimization with respect to the nonconvex DNN parameters leads to a global optimal solution under an idealized regularity condition, which can explain various empirical findings. Tong Zhang 🔗 Fri 4:20 a.m. - 4:30 a.m. Live Q&A with Tong Zhang (Zoom) (Q&A) Sebastian Stich 🔗 Fri 4:30 a.m. - 4:50 a.m. Invited speaker: Adaptation and universality in first-order methods, Volkan Cevher (Talk) In this talk, we review some of the recent advances in first-order methods for convex and non-convex optimization as well as their universality properties. We say an algorithm is universal if it does not require to know whether the optimization objective is smooth or not. We first recall the AdaGrad method and show that AdaGrad is a universal algorithm without any modifications: It implicitly exploits the smoothness of the problem to achieve the standard O(1/k) rate in the presence of smooth objectives, where k is the iteration count. To this end, we introduce an accelerated, universal variant of AdaGrad, dubbed as AcceleGrad, that in addition obtains the optimal convergence rate of O(1/k^2) in the smooth setting with deterministic oracles. We then introduce UniXGrad, which is the first algorithm that simultaneously achieves optimal rates for smooth or non-smooth problems with either deterministic or stochastic first-order oracles in the constrained convex setting. We conclude the presentation with results in non-convex optimization revolving around ADAM-type algorithms, including new convergence rates. Volkan Cevher 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed talks in Session 1 (Zoom) (Multiple talks)  link » Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Laurent Condat, "Distributed Proximal Splitting Algorithms with Rates and Acceleration" Zhize Li, "PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization" Ohad Shamir, "Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?" Tiffany Vlaar, "Constraint-Based Regularization of Neural Networks" Mohammadi Zaki, "Employing No Regret Learners for Pure Exploration in Linear Bandits" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. Link » Sebastian Stich · Laurent Condat · Zhize Li · Ohad Shamir · Tiffany Vlaar · Mohammadi Zaki 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed Video: Constraint-Based Regularization of Neural Networks, Tiffany Vlaar (Talk)  link »    We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. Our methods see performance improvements on image classification tasks. Link » Tiffany Vlaar 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed Video: Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?, Ohad Shamir (Talk)  link » It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (where the gradient norm is less than $\epsilon$) in $O(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently \emph{not} smooth, making these results inapplicable. Moreover, as recently pointed out in \citet{zhang2020complexity}, it is generally impossible to provide finite-time guarantees for finding an $\epsilon$-stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are *near* such $\epsilon$-stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only black-box access to the function values and gradients. We also discuss the pros and cons of alternative approaches. Link » Ohad Shamir 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed Video: Employing No Regret Learners for Pure Exploration in Linear Bandits, Mohammadi Zaki (Talk)  link »    We study the best arm identification problem in linear multi-armed bandits (LMAB) in the fixed confidence ($\delta$-PAC) setting; this is also the problem of optimizing an unknown linear function over a discrete ground set with noisy, zeroth-order access. We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Most previous approaches rely on access to a minimax optimization oracle which is at the heart of the complexity of the problem. We propose a method to solve this optimization problem (upto suitable accuracy) by interpreting the problem as a two-player zero-sum game, and attempting to sequentially converge to its saddle point using low-regret learners to compute the players' strategies in each round which yields a concrete querying algorithm. The algorithm, which we call the {\em Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting without requiring boundedness assumptions on the parameter space. PELEG is, thus, the first algorithm to achieve both order-optimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees. Link » Mohammadi Zaki 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed Video: Distributed Proximal Splitting Algorithms with Rates and Acceleration, Laurent Condat (Talk)  link »    We propose new generic distributed proximal splitting algorithms, well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new nonergodic rates, as well as new accelerated versions of the algorithms, using varying stepsizes. Link » Laurent Condat 🔗 Fri 5:00 a.m. - 5:30 a.m. Contributed Video: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, Zhize Li (Talk)  link »    In this paper, we propose a novel stochastic gradient estimator---ProbAbilistic Gradient Estimator (PAGE)---for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p$ and reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p$. We give a simple formula for the optimal choice of $p$. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finite-sum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak-\L ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE. Link » Zhize Li 🔗 Fri 6:00 a.m. - 6:50 a.m. Poster Session 1 (gather.town) (Poster session)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » Laurent Condat · Tiffany Vlaar · Ohad Shamir · Mohammadi Zaki · Zhize Li · Guan-Horng Liu · Samuel Horváth · Mher Safaryan · Yoni Choukroun · kumar Shridhar · Nabil Kahale · Jikai Jin · Pratik Kumar Jawanpuria · Gaurav Kumar Yadav · Kazuki Koyama · Junyoung Kim · Xiao Li · Saugata Purkayastha · Adil Salim · Dighanchal Banerjee · Peter Richtarik · Lakshman Mahto · Tian Ye · Bamdev Mishra · Huikang Liu · Jia-Jie Zhu 🔗 Fri 6:50 a.m. - 7:00 a.m. Welcome remarks to Session 2 (Opening remarks) Martin Takac 🔗 Fri 7:00 a.m. - 7:20 a.m. Invited speaker: Adaptive Sampling for Stochastic Risk-Averse Learning, Andreas Krause (Talk)    In high-stakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which allows scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks. Andreas Krause 🔗 Fri 7:20 a.m. - 7:30 a.m. Live Q&A with Andreas Krause (Zoom) (Q&A) Martin Takac 🔗 Fri 7:30 a.m. - 7:50 a.m. Invited speaker: Practical Kronecker-factored BFGS and L-BFGS methods for training deep neural networks, Donald Goldfarb (Talk)    In training deep neural network (DNN) models, computing and storing a full  BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is impractical. In our methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and L-BFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with state-of-the-art first-order stochastic methods. Donald Goldfarb 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed talks in Session 2 (Zoom) (Multiple talks) Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Samuel Horvath, "Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization" Guan-Horng Liu, "DDPNOpt: Differential Dynamic Programming Neural Optimizer" Nicolas Loizou, "Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence" Sharan Vaswani, "Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)" Sharan Vaswani, "How to make your optimizer generalize better" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. Martin Takac · Samuel Horváth · Guan-Horng Liu · Nicolas Loizou · Sharan Vaswani 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed Video: Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization, Samuel Horvath (Talk)  link »    Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the geometrization'' technique introduced by \cite{lei2016less} and the \texttt{SARAH} algorithm of \cite{nguyen2017sarah}, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-\L{}ojasiewicz (PL) constant, if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives. Link » Samuel Horváth 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed Video: Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence, Nicolas Loizou (Talk)  link »    We propose a stochastic variant of the classical Polyak step-size \citep{polyak1987introduction} commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models. Link » Nicolas Loizou 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed Video: DDPNOpt: Differential Dynamic Programming Neural Optimizer, Guan-Horng Liu (Talk)  link »    Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by first showing that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order method rooted in trajectory optimization. In this vein, we propose a new class of optimizer, DDP Neural Optimizer (DDPNOpt), for training DNNs. DDPNOpt features layer-wise feedback policies which improve convergence and robustness. It outperforms other optimal-control inspired training methods in both convergence and complexity, and is competitive against state-of-the-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory. Link » Guan-Horng Liu 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed Video: Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search), Sharan Vaswani (Talk)  link »    Adaptive gradient methods are typically used for training over-parameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an $O(1)$ regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant step-size AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the step-size and requires tuning. We alleviate this problem by using stochastic line-search (SLS) and Polyak's step-sizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problem-dependent constants and retain the convergence guarantees of their constant step-size counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks. Link » Sharan Vaswani 🔗 Fri 8:00 a.m. - 8:30 a.m. Contributed Video: How to make your optimizer generalize better, Sharan Vaswani (Talk)  link »    We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. For over-parameterized linear regression, where there are infinitely many interpolating solutions, different optimization methods can converge to solutions with varying generalization performance. In this setting, we show that projections onto linear spans can be used to move between solutions. Furthermore, via a simple reparameterization, we can ensure that an arbitrary optimizer converges to the minimum l2-norm solution with favourable generalization properties. For under-parameterized linear classification, optimizers can converge to different decision boundaries separating the data. We prove that for any such classifier, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data can result in better generalization. We validate our theoretical results via experiments on synthetic and real datasets. Link » Sharan Vaswani 🔗 Fri 8:30 a.m. - 9:00 a.m. Break (gather.town) (Break)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » 🔗 Fri 9:00 a.m. - 9:20 a.m. Invited speaker: SGD without replacement: optimal rate analysis and more, Suvrit Sra (Talk)    Stochastic gradient descent (SGD) is the workhorse of machine learning. There are two fundamental versions of SGD: (i) those that pick stochastic gradients with replacement, and (ii) those that pick without replacement. Ironically, version (ii) is what is used in practice (across most ML toolkits), while version (i) is what almost all published work analyzes. This mismatch is well-known. It arises because analyzing SGD without replacement involves biased gradients and must cope with lack of independence between the stochastic gradients used. In this talk, I will present recent progress on analyzing without replacement SGD, the bulk of which will focus on minimax optimal convergence rates. The rates are obtained without assuming componentwise convexity. I will mention further refinements of the results assuming this additional convexity, which remove drawbacks common to previous works (such as large number of epochs required) Suvrit Sra 🔗 Fri 9:20 a.m. - 9:30 a.m. Live Q&A with Suvrit Sra (Zoom) (Q&A) Martin Takac 🔗 Fri 9:45 a.m. - 10:50 a.m. Poster Session 2 (gather.town) (Poster session)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » Sharan Vaswani · Nicolas Loizou · Wenjie Li · Preetum Nakkiran · Zhan Gao · Sina Baghal · Jingfeng Wu · Roozbeh Yousefzadeh · Jinyi Wang · Jing Wang · Cong Xie · Anastasia Borovykh · Stanislaw Jastrzebski · Soham Dan · Yiliang Zhang · Mark Tuddenham · Sarath Pattathil · Ievgen Redko · Jeremy Cohen · Yasaman Esfandiari · Zhanhong Jiang · Mostafa ElAraby · Chulhee Yun · Michael Psenka · Robert Gower · Xiaoyu Wang 🔗 Fri 10:50 a.m. - 11:00 a.m. Welcome remarks to Session 3 (Opening remarks) Mark Schmidt 🔗 Fri 11:00 a.m. - 11:20 a.m. Invited speaker: Stochastic Geodesic Optimization, Ashia Wilson (Talk) Geodesic convexity offers a promising systematic way to handle non-convexity for many problems of interest in statistics and computer science. The focus of this talk will be to describe efforts to extend the basic tools of convex optimization on Euclidean space to the general setting of Riemannian manifolds. We begin by motivating our focus on geodesic optimization with several examples, reviewing the basics of geodesic spaces and several techniques from optimization along the way. Particular attention will be given to optimization techniques which achieve oracle lower bounds for minimizing stochastic functions, namely accelerated methods. We end with a discussion of how one might adapt these techniques to the Riemannian setting. Ashia Wilson 🔗 Fri 11:20 a.m. - 11:30 a.m. Live Q&A with Ashia Wilson (Zoom) (Q&A) Mark Schmidt 🔗 Fri 11:30 a.m. - 11:50 a.m. Invited speaker: Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA, Rachel Ward (Talk) We present new nonasymptotic growth and concentration bounds for a product of independent random matrices, similar in spirit to concentration for sums of independent random matrices developed in the previous decade. Our matrix product concentration bounds provide a new, direct convergence proof of Oja's algorithm for streaming Principal Component Analysis, and should be useful more broadly for analyzing the convergence of stochastic gradient descent for certain classes of nonconvex optimization problems, including neural networks. This talk covers joint work with Amelia Henriksen, De Huang, Jon Niles-Weed, and Joel Tropp. Rachel Ward 🔗 Fri 11:50 a.m. - 12:00 p.m. Live Q&A with Rachel Ward (Zoom) (Q&A) Mark Schmidt 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed talks in Session 3 (Zoom) (Multiple talks) Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Zhan Gao, "Incremental Greedy BFGS: An Incremental Quasi-Newton Method with Explicit Superlinear Rate" Wenjie Li, "Variance Reduction on Adaptive Stochastic Mirror Descent" Preetum Nakkiran, "Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems" Denny Wu, "When Does Preconditioning Help or Hurt Generalization?" Chengrun Yang, "TenIPS: Inverse Propensity Sampling for Tensor Completion" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed Video: Variance Reduction on Adaptive Stochastic Mirror Descent, Wenjie Li (Talk)  link »    We study the application of the variance reduction technique on general adaptive stochastic mirror descent algorithms in nonsmooth nonconvex optimization problems. We prove that variance reduction helps to reduce the gradient complexity of most general stochastic mirror descent algorithms, so it works well with time-varying steps sizes and adaptive optimization algorithms such as AdaGrad. We check the validity of our claims using experiments in deep learning. Link » Wenjie Li 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed Video: Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems, Preetum Nakkiran (Talk)  link »    Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li et al. (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this work, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping. Link » Preetum Nakkiran 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed Video: When Does Preconditioning Help or Hurt Generalization?, Denny Wu (Talk)  link »    While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of optimizers affects the comparison of generalization properties. We provide an exact bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and shape'' of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss approaches to manage the bias-variance tradeoff, and the benefit of interpolating between first- and second-order updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network, and observe robust trends matching our theoretical analysis. Link » Denny Wu 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed Video: Incremental Greedy BFGS: An Incremental Quasi-Newton Method with Explicit Superlinear Rate, Zhan Gao (Talk)  link »    Finite-sum minimization, i.e., problems where the objective may be written as the sum over a collection of instantaneous costs, are ubiquitous in modern machine learning and data science. Efficient numerical techniques for their solution must trade off per-step complexity with the number of steps required for convergence. Incremental Quasi-Newton methods (IQN) achieve a favorable balance of these competing attributes in the sense that their complexity is independent of the sample size, while their convergence rate can be faster than linear. This local superlinear behavior, to date, however, is known only asymptotically. In this work, we put forth a new variant of IQN, specifically of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) type, that incorporates a greedy basis vector selection step, and admits a non-asymptotic explicit local superlinear rate. To the best of our knowledge, this is the first time an explicit superlinear rate has been given for Quasi-Newton methods in the incremental setting. Link » Zhan Gao 🔗 Fri 12:00 p.m. - 12:30 p.m. Contributed Video: TenIPS: Inverse Propensity Sampling for Tensor Completion, Chengrun Yang (Talk)  link »    Tensors are widely used to model relationships among objects in a high-dimensional space. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a randomized linear algebra approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. Link » Chengrun Yang 🔗 Fri 12:30 p.m. - 1:30 p.m. Break (gather.town) (Break)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » 🔗 Fri 1:30 p.m. - 1:50 p.m. Invited speaker: Fast convergence of stochastic subgradient method under interpolation, Michael Friedlander (Talk)    This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized empirical-risk optimization models that exactly fit the training data. We prove that for models with composite structures often found in neural networks, the interpolation condition implies that the model is effectively smooth at all minimizers, and therefore that SSGD converges at rates normally achievable only for smooth convex problems. We also prove that the fast rates we derive are optimal for any subgradient method applied to convex problems where interpolation holds. This is joint work with Huang Fang and Zhenan Fan. Michael Friedlander 🔗 Fri 1:30 p.m. - 1:35 p.m. Intro to Invited Speaker 8 (Organizer intro) Mark Schmidt 🔗 Fri 1:50 p.m. - 2:00 p.m. Live Q&A with Michael Friedlander (Zoom) (Q&A) Mark Schmidt 🔗 Fri 2:00 p.m. - 2:50 p.m. Poster Session 3 (gather.town) (Poster session)  link » Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. Link » Denny Wu · Chengrun Yang · Tolga Ergen · sanae lotfi · Charles Guille-Escuret · Boris Ginsburg · Hanbake Lyu · Cong Xie · David Newton · Debraj Basu · Yewen Wang · James Lucas · MAOJIA LI · Lijun Ding · Jose Javier Gonzalez Ortiz · Reyhane Askari Hemmat · Zhiqi Bu · Neal Lawton · Kiran Thekumparampil · Jiaming Liang · Lindon Roberts · Jingyi Zhu · Dongruo Zhou 🔗 Fri 2:50 p.m. - 3:00 p.m. Welcome remarks to Session 4 (Opening remarks) Quanquan Gu 🔗 Fri 3:00 p.m. - 3:20 p.m. Invited speaker: Online nonnegative matrix factorization for Markovian and other real data, Deanna Needell and Hanbaek Lyu (Talk)    Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a non-convex generalization of the well-known OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on real-world data and discuss recent extensions and variations. Hanbake Lyu · Deanna Needell 🔗 Fri 3:20 p.m. - 3:30 p.m. Live Q&A with Deanna Needell and Hanbake Lyu (Zoom) (Q&A) Quanquan Gu 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed talks in Session 4 (Zoom) (Multiple talks) Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Tolga Ergen, "Convex Programs for Global Optimization of Convolutional Neural Networks in Polynomial-Time" Charles Guille-Escuret, "A Study of Condition Numbers for First-Order Optimization" Lewis Liu, "Affine-Invariant Analysis of Frank-Wolfe on Strongly Convex Sets" Sanae Lotfi, "Stochastic Damped L-BFGS with controlled norm of the Hessian approximation" Dongruo Zhou, "On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. Quanquan Gu · sanae lotfi · Charles Guille-Escuret · Tolga Ergen · Dongruo Zhou 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed Video: Stochastic Damped L-BFGS with controlled norm of the Hessian approximation, Sanae Lotfi (Talk)  link »    We propose a new stochastic variance-reduced damped L-BFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped L-BFGS algorithm called SdLBFGS. We establish almost sure converges to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGS-VR and SVRG on a modified DavidNet problem -- a highly nonconvex and ill-conditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex support-vector machine problem. Link » sanae lotfi 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed Video: A Study of Condition Numbers for First-Order Optimization, Charles Guille-Escuret (Talk)  link »    In this work we introduce a new framework for the theoretical study of convergence and tuning of first-order optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy.Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. Link » Charles Guille-Escuret 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed Video: Affine-Invariant Analysis of Frank-Wolfe on Strongly Convex Sets, Lewis Liu (Talk)  link »    When the constraint set $\mathcal{C}$ is strongly convex, the Frank-Wolfe algorithm, which is affine co-variant, enjoys accelerated convergence rates. In contrast, existing results rely on norm-dependent assumptions, usually incurring non-affine invariant bounds. In this work, we introduce new structural assumptions on the problem and derive an affine invariant, norm-independent analysis of Frank Wolfe. Based on our analysis, we propose an affine-invariant backtracking line-search. Interestingly, we show that typical backtracking line-searches using smoothness of the objective function surprisingly converge to an affine-invariant stepsize, despite using affine-dependent norms in the computation of the stepsize. Link » 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed Video: Convex Programs for Global Optimization of Convolutional Neural Networks in Polynomial-Time, Tolga Ergen (Talk)  link »    We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons and data dimension. Particularly, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for two-layer CNNs, where convex problems are regularized by the sum of $\ell_2$ norms of variables. Link » Tolga Ergen 🔗 Fri 3:30 p.m. - 4:00 p.m. Contributed Video: On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization, Dongruo Zhou (Talk)  link »    Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. Link » Dongruo Zhou 🔗 Fri 4:00 p.m. - 4:05 p.m. Closing remarks Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗

#### Author Information

##### Sebastian Stich (EPFL)

Dr. [Sebastian U. Stich](https://sstich.ch/) is a faculty at the CISPA Helmholtz Center for Information Security. Research interests: - *methods for machine learning and statistics*—at the interface of theory and practice - *collaborative learning* (distributed, federated and decentralized methods) - *optimization for machine learning* (adaptive stochastic methods and generalization performance)