Timezone: »
Optimization lies at the heart of many machine learning algorithms and enjoys great interest in our community. Indeed, this intimate relation of optimization with ML is the key motivation for the OPT series of workshops.
Looking back over the past decade, a strong trend is apparent: The intersection of OPT and ML has grown to the point that now cuttingedge advances in optimization often arise from the ML community. The distinctive feature of optimization within ML is its departure from textbook approaches, in particular, its focus on a different set of goals driven by "bigdata, nonconvexity, and highdimensions," where both theory and implementation are crucial.
We wish to use OPT 2020 as a platform to foster discussion, discovery, and dissemination of the stateoftheart in optimization as relevant to machine learning. And well beyond that: as a platform to identify new directions and challenges that will drive future research, and continue to build the OPT+ML joint research community.
Invited Speakers
Volkan Cevher (EPFL)
Michael Friedlander (UBC)
Donald Goldfarb (Columbia)
Andreas Krause (ETH, Zurich)
Suvrit Sra (MIT)
Rachel Ward (UT Austin)
Ashia Wilson (MSR)
Tong Zhang (HKUST)
Instructions
Please join us in gather.town for all breaks and poster sessions (Click "Open Link" on any break or poster session).
To see all submitted paper and posters, go to the "optml website" at the top of the page.
Use RocketChat or Zoom link (top of page) if you want to ask the speaker a direct question during the Live Q&A and Contributed Talks.
Fri 3:15 a.m.  3:50 a.m.

Welcome event (gather.town)
(Social event/Break)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Quanquan Gu, Courtney Paquette, Mark Schmidt, Sebastian Stich, Martin Takac 
Fri 3:50 a.m.  4:00 a.m.

Welcome remarks to Session 1
(Opening remarks)

Sebastian Stich 
Fri 4:00 a.m.  4:20 a.m.

Invited speaker: The Convexity of Learning Infinitewidth Deep Neural Networks, Tong Zhang
(Talk)
»
SlidesLive Video » Deep learning has received considerable empirical successes in recent years. Although deep neural networks (DNNs) are highly nonconvex with respect to the model parameters, it has been observed that the training of overparametrized DNNs leads to consistent solutions that are highly reproducible with different random initializations. I will explain this phenomenon by modeling DNNs using feature representations, and show that the optimization landscape is convex with respect to the features. Moreover, we show that optimization with respect to the nonconvex DNN parameters leads to a global optimal solution under an idealized regularity condition, which can explain various empirical findings. 
Tong Zhang 
Fri 4:20 a.m.  4:30 a.m.

Live Q&A with Tong Zhang (Zoom)
(Q&A)

Sebastian Stich 
Fri 4:30 a.m.  4:50 a.m.

Invited speaker: Adaptation and universality in firstorder methods, Volkan Cevher
(Talk)
»
In this talk, we review some of the recent advances in firstorder methods for convex and nonconvex optimization as well as their universality properties. We say an algorithm is universal if it does not require to know whether the optimization objective is smooth or not. We first recall the AdaGrad method and show that AdaGrad is a universal algorithm without any modifications: It implicitly exploits the smoothness of the problem to achieve the standard O(1/k) rate in the presence of smooth objectives, where k is the iteration count. To this end, we introduce an accelerated, universal variant of AdaGrad, dubbed as AcceleGrad, that in addition obtains the optimal convergence rate of O(1/k^2) in the smooth setting with deterministic oracles. We then introduce UniXGrad, which is the first algorithm that simultaneously achieves optimal rates for smooth or nonsmooth problems with either deterministic or stochastic firstorder oracles in the constrained convex setting. We conclude the presentation with results in nonconvex optimization revolving around ADAMtype algorithms, including new convergence rates. 
Volkan Cevher 
Fri 5:00 a.m.  5:30 a.m.

Contributed talks in Session 1 (Zoom)
(Multiple talks)
»
link »
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Laurent Condat, "Distributed Proximal Splitting Algorithms with Rates and Acceleration" Zhize Li, "PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization" Ohad Shamir, "Can We Find NearApproximatelyStationary Points of Nonsmooth Nonconvex Functions?" Tiffany Vlaar, "ConstraintBased Regularization of Neural Networks" Mohammadi Zaki, "Employing No Regret Learners for Pure Exploration in Linear Bandits" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Sebastian Stich, Laurent Condat, Zhize Li, Ohad Shamir, Tiffany Vlaar, Mohammadi Zaki 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: ConstraintBased Regularization of Neural Networks, Tiffany Vlaar
(Talk)
»
link »
SlidesLive Video » We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. Our methods see performance improvements on image classification tasks. 
Tiffany Vlaar 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Can We Find NearApproximatelyStationary Points of Nonsmooth Nonconvex Functions?, Ohad Shamir
(Talk)
»
link »
It is wellknown that given a bounded, smooth nonconvex function, standard gradientbased methods can find $\epsilon$stationary points (where the gradient norm is less than $\epsilon$) in $O(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently \emph{not} smooth, making these results inapplicable. Moreover, as recently pointed out in \citet{zhang2020complexity}, it is generally impossible to provide finitetime guarantees for finding an $\epsilon$stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are *near* such $\epsilon$stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only blackbox access to the function values and gradients. We also discuss the pros and cons of alternative approaches.

Ohad Shamir 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Employing No Regret Learners for Pure Exploration in Linear Bandits, Mohammadi Zaki
(Talk)
»
link »
SlidesLive Video »
We study the best arm identification problem in linear multiarmed bandits (LMAB) in the fixed confidence ($\delta$PAC) setting; this is also the problem of optimizing an unknown linear function over a discrete ground set with noisy, zerothorder access. We propose an explicitly implementable and provably orderoptimal samplecomplexity algorithm to solve this problem. Most previous approaches rely on access to a minimax optimization oracle which is at the heart of the complexity of the problem. We propose a method to solve this optimization problem (upto suitable accuracy) by interpreting the problem as a twoplayer zerosum game, and attempting to sequentially converge to its saddle point using lowregret learners to compute the players' strategies in each round which yields a concrete querying algorithm. The algorithm, which we call the {\em Phased Elimination Linear Exploration Game} (PELEG), maintains a highprobability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. We analyze the sample complexity of PELEG and show that it matches, up to order, an instancedependent lower bound on sample complexity in the linear bandit setting without requiring boundedness assumptions on the parameter space. PELEG is, thus, the first algorithm to achieve both orderoptimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.

Mohammadi Zaki 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Distributed Proximal Splitting Algorithms with Rates and Acceleration, Laurent Condat
(Talk)
»
link »
SlidesLive Video » We propose new generic distributed proximal splitting algorithms, well suited for largescale convex nonsmooth optimization. We derive sublinear and linear convergence results with new nonergodic rates, as well as new accelerated versions of the algorithms, using varying stepsizes. 
Laurent Condat 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, Zhize Li
(Talk)
»
link »
SlidesLive Video »
In this paper, we propose a novel stochastic gradient estimatorProbAbilistic Gradient Estimator (PAGE)for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p$ and reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1p$. We give a simple formula for the optimal choice of $p$. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finitesum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak\L ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.

Zhize Li 
Fri 6:00 a.m.  6:50 a.m.

Poster Session 1 (gather.town)
(Poster session)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Laurent Condat, Tiffany Vlaar, Ohad Shamir, Mohammadi Zaki, Zhize Li, GuanHorng Liu, Samuel Horváth, Mher Safaryan, Yoni Choukroun, kumar Shridhar, Nabil Kahale, Jikai Jin, Pratik Kumar Jawanpuria, Gaurav Kumar Yadav, Kazuki Koyama, Junyoung Kim, Xiao Li, Saugata Purkayastha, Adil Salim, Dighanchal Banerjee, Peter Richtarik, Lakshman Mahto, Tian Ye, Bamdev Mishra, Huikang Liu, JiaJie Zhu

Fri 6:50 a.m.  7:00 a.m.

Welcome remarks to Session 2
(Opening remarks)

Martin Takac 
Fri 7:00 a.m.  7:20 a.m.

Invited speaker: Adaptive Sampling for Stochastic RiskAverse Learning, Andreas Krause
(Talk)
»
SlidesLive Video » In highstakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional ValueatRisk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zerosum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which allows scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on largescale convex and nonconvex learning tasks. 
Andreas Krause 
Fri 7:20 a.m.  7:30 a.m.

Live Q&A with Andreas Krause (Zoom)
(Q&A)

Martin Takac 
Fri 7:30 a.m.  7:50 a.m.

Invited speaker: Practical Kroneckerfactored BFGS and LBFGS methods for training deep neural networks, Donald Goldfarb
(Talk)
»
SlidesLive Video » In training deep neural network (DNN) models, computing and storing a full BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an LBFGS implementation is impractical. In our methods, we approximate the Hessian by a blockdiagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and LBFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with stateoftheart firstorder stochastic methods. 
Donald Goldfarb 
Fri 8:00 a.m.  8:30 a.m.

Contributed talks in Session 2 (Zoom)
(Multiple talks)
»
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Samuel Horvath, "Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization" GuanHorng Liu, "DDPNOpt: Differential Dynamic Programming Neural Optimizer" Nicolas Loizou, "Stochastic Polyak Stepsize for SGD: An Adaptive Learning Rate for Fast Convergence" Sharan Vaswani, "Adaptive Gradient Methods Converge Faster with OverParameterization (and you can do a linesearch)" Sharan Vaswani, "How to make your optimizer generalize better" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Martin Takac, Samuel Horváth, GuanHorng Liu, Nicolas Loizou, Sharan Vaswani 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization, Samuel Horvath
(Talk)
»
link »
SlidesLive Video » Adaptivity is an important yet understudied property in modern optimization theory. The gap between the stateoftheart theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as stepsize schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the ``geometrization'' technique introduced by \cite{lei2016less} and the \texttt{SARAH} algorithm of \cite{nguyen2017sarah}, we propose the Geometrized \texttt{SARAH} algorithm for nonconvex finitesum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak\L{}ojasiewicz (PL) constant, if present. In addition, it achieves the bestavailable convergence rate for nonPL objectives simultaneously while outperforming existing algorithms for PL objectives. 
Samuel Horváth 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Stochastic Polyak Stepsize for SGD: An Adaptive Learning Rate for Fast Convergence, Nicolas Loizou
(Talk)
»
link »
SlidesLive Video » We propose a stochastic variant of the classical Polyak stepsize \citep{polyak1987introduction} commonly used in the subgradient method. Although computing the Polyak stepsize requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak stepsize (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and nonconvex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant stepsize. We show that SPS is particularly effective when training overparameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problemdependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to stateoftheart optimization methods when training overparameterized models. 
Nicolas Loizou 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: DDPNOpt: Differential Dynamic Programming Neural Optimizer, GuanHorng Liu
(Talk)
»
link »
SlidesLive Video » Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by first showing that most widelyused algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated secondorder method rooted in trajectory optimization. In this vein, we propose a new class of optimizer, DDP Neural Optimizer (DDPNOpt), for training DNNs. DDPNOpt features layerwise feedback policies which improve convergence and robustness. It outperforms other optimalcontrol inspired training methods in both convergence and complexity, and is competitive against stateoftheart first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory. 
GuanHorng Liu 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Adaptive Gradient Methods Converge Faster with OverParameterization (and you can do a linesearch), Sharan Vaswani
(Talk)
»
link »
SlidesLive Video »
Adaptive gradient methods are typically used for training overparameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant stepsize and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an $O(1)$ regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant stepsize AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the stepsize and requires tuning. We alleviate this problem by using stochastic linesearch (SLS) and Polyak's stepsizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problemdependent constants and retain the convergence guarantees of their constant stepsize counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks.

Sharan Vaswani 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: How to make your optimizer generalize better, Sharan Vaswani
(Talk)
»
link »
SlidesLive Video » We study the implicit regularization of optimization methods for linear models interpolating the training data in the underparametrized and overparametrized regimes. For overparameterized linear regression, where there are infinitely many interpolating solutions, different optimization methods can converge to solutions with varying generalization performance. In this setting, we show that projections onto linear spans can be used to move between solutions. Furthermore, via a simple reparameterization, we can ensure that an arbitrary optimizer converges to the minimum l2norm solution with favourable generalization properties. For underparameterized linear classification, optimizers can converge to different decision boundaries separating the data. We prove that for any such classifier, there exists a family of quadratic norms ._P such that the classifier's direction is the same as that of the maximum Pmargin solution. We argue that analyzing convergence to the standard maximum l2margin is arbitrary and show that minimizing the norm induced by the data can result in better generalization. We validate our theoretical results via experiments on synthetic and real datasets. 
Sharan Vaswani 
Fri 8:30 a.m.  9:00 a.m.

Break (gather.town)
(Break)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 

Fri 9:00 a.m.  9:20 a.m.

Invited speaker: SGD without replacement: optimal rate analysis and more, Suvrit Sra
(Talk)
»
SlidesLive Video » Stochastic gradient descent (SGD) is the workhorse of machine learning. There are two fundamental versions of SGD: (i) those that pick stochastic gradients with replacement, and (ii) those that pick without replacement. Ironically, version (ii) is what is used in practice (across most ML toolkits), while version (i) is what almost all published work analyzes. This mismatch is wellknown. It arises because analyzing SGD without replacement involves biased gradients and must cope with lack of independence between the stochastic gradients used. In this talk, I will present recent progress on analyzing without replacement SGD, the bulk of which will focus on minimax optimal convergence rates. The rates are obtained without assuming componentwise convexity. I will mention further refinements of the results assuming this additional convexity, which remove drawbacks common to previous works (such as large number of epochs required) 
Suvrit Sra 
Fri 9:20 a.m.  9:30 a.m.

Live Q&A with Suvrit Sra (Zoom)
(Q&A)

Martin Takac 
Fri 9:45 a.m.  10:50 a.m.

Poster Session 2 (gather.town)
(Poster session)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Sharan Vaswani, Nicolas Loizou, Wenjie Li, Preetum Nakkiran, Zhan Gao, Sina Baghal, Jingfeng Wu, Roozbeh Yousefzadeh, Jinyi Wang, Jing Wang, Cong Xie, Anastasia Borovykh, Stanislaw Jastrzebski, Soham Dan, Yiliang Zhang, Mark Tuddenham, Sarath Pattathil, Ievgen Redko, Jeremy Cohen, Yasaman Esfandiari, Zhanhong Jiang, Mostafa ElAraby, Chulhee Yun, Michael Psenka, Robert Gower, Xiaoyu Wang

Fri 10:50 a.m.  11:00 a.m.

Welcome remarks to Session 3
(Opening remarks)

Mark Schmidt 
Fri 11:00 a.m.  11:20 a.m.

Invited speaker: Stochastic Geodesic Optimization, Ashia Wilson
(Talk)
»
Geodesic convexity offers a promising systematic way to handle nonconvexity for many problems of interest in statistics and computer science. The focus of this talk will be to describe efforts to extend the basic tools of convex optimization on Euclidean space to the general setting of Riemannian manifolds. We begin by motivating our focus on geodesic optimization with several examples, reviewing the basics of geodesic spaces and several techniques from optimization along the way. Particular attention will be given to optimization techniques which achieve oracle lower bounds for minimizing stochastic functions, namely accelerated methods. We end with a discussion of how one might adapt these techniques to the Riemannian setting. 
Ashia Wilson 
Fri 11:20 a.m.  11:30 a.m.

Live Q&A with Ashia Wilson (Zoom)
(Q&A)

Mark Schmidt 
Fri 11:30 a.m.  11:50 a.m.

Invited speaker: Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA, Rachel Ward
(Talk)
»
We present new nonasymptotic growth and concentration bounds for a product of independent random matrices, similar in spirit to concentration for sums of independent random matrices developed in the previous decade. Our matrix product concentration bounds provide a new, direct convergence proof of Oja's algorithm for streaming Principal Component Analysis, and should be useful more broadly for analyzing the convergence of stochastic gradient descent for certain classes of nonconvex optimization problems, including neural networks. This talk covers joint work with Amelia Henriksen, De Huang, Jon NilesWeed, and Joel Tropp. 
Rachel Ward 
Fri 11:50 a.m.  12:00 p.m.

Live Q&A with Rachel Ward (Zoom)
(Q&A)

Mark Schmidt 
Fri 12:00 p.m.  12:30 p.m.

Contributed talks in Session 3 (Zoom)
(Multiple talks)
»
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Zhan Gao, "Incremental Greedy BFGS: An Incremental QuasiNewton Method with Explicit Superlinear Rate" Wenjie Li, "Variance Reduction on Adaptive Stochastic Mirror Descent" Preetum Nakkiran, "Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems" Denny Wu, "When Does Preconditioning Help or Hurt Generalization?" Chengrun Yang, "TenIPS: Inverse Propensity Sampling for Tensor Completion" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Mark Schmidt, Zhan Gao, Wenjie Li, Preetum Nakkiran, Denny Wu, Chengrun Yang 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Variance Reduction on Adaptive Stochastic Mirror Descent, Wenjie Li
(Talk)
»
link »
SlidesLive Video » We study the application of the variance reduction technique on general adaptive stochastic mirror descent algorithms in nonsmooth nonconvex optimization problems. We prove that variance reduction helps to reduce the gradient complexity of most general stochastic mirror descent algorithms, so it works well with timevarying steps sizes and adaptive optimization algorithms such as AdaGrad. We check the validity of our claims using experiments in deep learning. 
Wenjie Li 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems, Preetum Nakkiran
(Talk)
»
link »
SlidesLive Video » Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li et al. (2019) recently proved this behavior can exist in a simplified nonconvex neuralnetwork setting. In this work, we show that this phenomenon can exist even for convex learning problems  in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and earlystopping. 
Preetum Nakkiran 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: When Does Preconditioning Help or Hurt Generalization?, Denny Wu
(Talk)
»
link »
SlidesLive Video »
While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of optimizers affects the comparison of generalization properties. We provide an exact biasvariance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and ``shape'' of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a wellspecified model, or aligned signal. Based on this analysis, we discuss approaches to manage the biasvariance tradeoff, and the benefit of interpolating between first and secondorder updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first and secondorder optimizers in neural network, and observe robust trends matching our theoretical analysis.

Denny Wu 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Incremental Greedy BFGS: An Incremental QuasiNewton Method with Explicit Superlinear Rate, Zhan Gao
(Talk)
»
link »
SlidesLive Video » Finitesum minimization, i.e., problems where the objective may be written as the sum over a collection of instantaneous costs, are ubiquitous in modern machine learning and data science. Efficient numerical techniques for their solution must trade off perstep complexity with the number of steps required for convergence. Incremental QuasiNewton methods (IQN) achieve a favorable balance of these competing attributes in the sense that their complexity is independent of the sample size, while their convergence rate can be faster than linear. This local superlinear behavior, to date, however, is known only asymptotically. In this work, we put forth a new variant of IQN, specifically of the BroydenFletcherGoldfarbShanno (BFGS) type, that incorporates a greedy basis vector selection step, and admits a nonasymptotic explicit local superlinear rate. To the best of our knowledge, this is the first time an explicit superlinear rate has been given for QuasiNewton methods in the incremental setting. 
Zhan Gao 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: TenIPS: Inverse Propensity Sampling for Tensor Completion, Chengrun Yang
(Talk)
»
link »
SlidesLive Video » Tensors are widely used to model relationships among objects in a highdimensional space. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a randomized linear algebra approach, reweighting the observed tensor by the inverse propensities. We provide finitesample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. 
Chengrun Yang 
Fri 12:30 p.m.  1:30 p.m.

Break (gather.town)
(Break)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 

Fri 1:30 p.m.  1:50 p.m.

Invited speaker: Fast convergence of stochastic subgradient method under interpolation, Michael Friedlander
(Talk)
»
SlidesLive Video » This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to overparameterized empiricalrisk optimization models that exactly fit the training data. We prove that for models with composite structures often found in neural networks, the interpolation condition implies that the model is effectively smooth at all minimizers, and therefore that SSGD converges at rates normally achievable only for smooth convex problems. We also prove that the fast rates we derive are optimal for any subgradient method applied to convex problems where interpolation holds. This is joint work with Huang Fang and Zhenan Fan. 
Michael Friedlander 
Fri 1:30 p.m.  1:35 p.m.

Intro to Invited Speaker 8
(Organizer intro)

Mark Schmidt 
Fri 1:50 p.m.  2:00 p.m.

Live Q&A with Michael Friedlander (Zoom)
(Q&A)

Mark Schmidt 
Fri 2:00 p.m.  2:50 p.m.

Poster Session 3 (gather.town)
(Poster session)
»
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Denny Wu, Chengrun Yang, Tolga Ergen, sanae lotfi, Charles GuilleEscuret, Boris Ginsburg, Hanbake Lyu, Cong Xie, David Newton, Debraj Basu, Yewen Wang, James Lucas, MAOJIA LI, Lijun Ding, Jose Javier Gonzalez Ortiz, Reyhane Askari Hemmat, Zhiqi Bu, Neal Lawton, Kiran Thekumparampil, Jiaming Liang, Lindon Roberts, Jingyi Zhu, Dongruo Zhou

Fri 2:50 p.m.  3:00 p.m.

Welcome remarks to Session 4
(Opening remarks)

Quanquan Gu 
Fri 3:00 p.m.  3:20 p.m.

Invited speaker: Online nonnegative matrix factorization for Markovian and other real data, Deanna Needell and Hanbaek Lyu
(Talk)
»
SlidesLive Video » Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a nonconvex generalization of the wellknown OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online nonnegative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on realworld data and discuss recent extensions and variations. 
Hanbake Lyu, Deanna Needell 
Fri 3:20 p.m.  3:30 p.m.

Live Q&A with Deanna Needell and Hanbake Lyu (Zoom)
(Q&A)

Quanquan Gu 
Fri 3:30 p.m.  4:00 p.m.

Contributed talks in Session 4 (Zoom)
(Multiple talks)
»
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Tolga Ergen, "Convex Programs for Global Optimization of Convolutional Neural Networks in PolynomialTime" Charles GuilleEscuret, "A Study of Condition Numbers for FirstOrder Optimization" Lewis Liu, "AffineInvariant Analysis of FrankWolfe on Strongly Convex Sets" Sanae Lotfi, "Stochastic Damped LBFGS with controlled norm of the Hessian approximation" Dongruo Zhou, "On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Quanquan Gu, sanae lotfi, Charles GuilleEscuret, Tolga Ergen, Dongruo Zhou 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: Stochastic Damped LBFGS with controlled norm of the Hessian approximation, Sanae Lotfi
(Talk)
»
link »
SlidesLive Video » We propose a new stochastic variancereduced damped LBFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped LBFGS algorithm called SdLBFGS. We establish almost sure converges to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem  a highly nonconvex and illconditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex supportvector machine problem. 
sanae lotfi 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: A Study of Condition Numbers for FirstOrder Optimization, Charles GuilleEscuret
(Talk)
»
link »
SlidesLive Video » In this work we introduce a new framework for the theoretical study of convergence and tuning of firstorder optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy.Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. 
Charles GuilleEscuret 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: AffineInvariant Analysis of FrankWolfe on Strongly Convex Sets, Lewis Liu
(Talk)
»
link »
SlidesLive Video »
When the constraint set $\mathcal{C}$ is strongly convex, the FrankWolfe algorithm, which is affine covariant, enjoys accelerated convergence rates. In contrast, existing results rely on normdependent assumptions, usually incurring nonaffine invariant bounds. In this work, we introduce new structural assumptions on the problem and derive an affine invariant, normindependent analysis of Frank Wolfe. Based on our analysis, we propose an affineinvariant backtracking linesearch. Interestingly, we show that typical backtracking linesearches using smoothness of the objective function surprisingly converge to an affineinvariant stepsize, despite using affinedependent norms in the computation of the stepsize.


Fri 3:30 p.m.  4:00 p.m.

Contributed Video: Convex Programs for Global Optimization of Convolutional Neural Networks in PolynomialTime, Tolga Ergen
(Talk)
»
link »
SlidesLive Video »
We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons and data dimension. Particularly, we develop a convex analytic framework utilizing semiinfinite duality to obtain equivalent convex optimization problems for twolayer CNNs, where convex problems are regularized by the sum of $\ell_2$ norms of variables.

Tolga Ergen 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization, Dongruo Zhou
(Talk)
»
link »
SlidesLive Video » Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a finegrained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a firstorder stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. 
Dongruo Zhou 
Fri 4:00 p.m.  4:05 p.m.

Closing remarks

Quanquan Gu, Courtney Paquette, Mark Schmidt, Sebastian Stich, Martin Takac 
Author Information
Courtney Paquette (Lehigh University)
Mark Schmidt (University of British Columbia)
Sebastian Stich (EPFL)
Dr. [Sebastian U. Stich](https://sstich.ch/) is a postdoctoral researcher in machine learning at EPFL (Lausanne, Switzerland). Research interests:  *methods for machine learning and statistics*—at the interface of theory and practice  *collaborative learning* (distributed, federated and decentralized methods)  *optimization for machine learning* (adaptive stochastic methods and generalization performance)
Quanquan Gu (UCLA)
Martin Takac (Lehigh University)
More from the Same Authors

2021 Workshop: OPT 2021: Optimization for Machine Learning »
Courtney Paquette · Quanquan Gu · Oliver Hinder · Katya Scheinberg · Sebastian Stich · Martin Takac 
2020 Poster: A Generalized Neural Tangent Kernel Analysis for Twolayer Neural Networks »
Zixiang Chen · Yuan Cao · Quanquan Gu · Tong Zhang 
2020 Poster: Regret Bounds without Lipschitz Continuity: Online Learning with RelativeLipschitz Losses »
Yihan Zhou · Victor Sanches Portella · Mark Schmidt · Nicholas Harvey 
2020 Poster: Ensemble Distillation for Robust Model Fusion in Federated Learning »
Tao Lin · Lingjing Kong · Sebastian Stich · Martin Jaggi 
2020 Poster: Agnostic Learning of a Single Neuron with Gradient Descent »
Spencer Frei · Yuan Cao · Quanquan Gu 
2020 Poster: A FiniteTime Analysis of Two TimeScale ActorCritic Methods »
Yue Frank Wu · Weitong ZHANG · Pan Xu · Quanquan Gu 
2019 Poster: AlgorithmDependent Generalization Bounds for Overparameterized Deep Residual Networks »
Spencer Frei · Yuan Cao · Quanquan Gu 
2019 Poster: LayerDependent Importance Sampling for Training Deep and Large Graph Convolutional Networks »
Difan Zou · Ziniu Hu · Yewen Wang · Song Jiang · Yizhou Sun · Quanquan Gu 
2019 Poster: Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates »
Sharan Vaswani · Aaron Mishkin · Issam Laradji · Mark Schmidt · Gauthier Gidel · Simon LacosteJulien 
2019 Poster: Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction »
Difan Zou · Pan Xu · Quanquan Gu 
2019 Poster: Tight Sample Complexity of Learning Onehiddenlayer Convolutional Neural Networks »
Yuan Cao · Quanquan Gu 
2019 Poster: An Improved Analysis of Training Overparameterized Deep Neural Networks »
Difan Zou · Quanquan Gu 
2019 Poster: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu 
2019 Spotlight: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu 
2018 Poster: Thirdorder Smoothness Helps: Faster Stochastic Optimization Algorithms for Finding Local Minima »
Yaodong Yu · Pan Xu · Quanquan Gu 
2018 Poster: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu 
2018 Poster: SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient »
Aaron Mishkin · Frederik Kunstner · Didrik Nielsen · Mark Schmidt · Mohammad Emtiyaz Khan 
2018 Spotlight: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu 
2018 Poster: Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder Optimization »
Robert Gower · Filip Hanzely · Peter Richtarik · Sebastian Stich 
2018 Poster: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu 
2018 Poster: Reinforcement Learning for Solving the Vehicle Routing Problem »
MohammadReza Nazari · Afshin Oroojlooy · Lawrence Snyder · Martin Takac 
2018 Spotlight: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu 
2018 Poster: Sparsified SGD with Memory »
Sebastian Stich · JeanBaptiste Cordonnier · Martin Jaggi 
2018 Poster: Distributed Learning without Distress: PrivacyPreserving Empirical Risk Minimization »
Bargav Jayaraman · Lingxiao Wang · David Evans · Quanquan Gu 
2017 Poster: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi 
2017 Spotlight: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi 
2017 Poster: Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization »
Pan Xu · Jian Ma · Quanquan Gu 
2016 Poster: Semiparametric Differential Graph Models »
Pan Xu · Quanquan Gu 
2016 Poster: A MultiBatch LBFGS Method for Machine Learning »
Albert Berahas · Jorge Nocedal · Martin Takac 
2015 Poster: High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality »
Zhaoran Wang · Quanquan Gu · Yang Ning · Han Liu 
2015 Poster: StopWasting My Gradients: Practical SVRG »
Reza Babanezhad Harikandeh · Mohamed Osama Ahmed · Alim Virani · Mark Schmidt · Jakub Konečný · Scott Sallinen 
2014 Poster: CommunicationEfficient Distributed Dual Coordinate Ascent »
Martin Jaggi · Virginia Smith · Martin Takac · Jonathan Terhorst · Sanjay Krishnan · Thomas Hofmann · Michael Jordan 
2014 Poster: Sparse PCA with Oracle Property »
Quanquan Gu · Zhaoran Wang · Han Liu 
2014 Poster: Robust Tensor Decomposition with Gross Corruption »
Quanquan Gu · Huan Gui · Jiawei Han 
2012 Poster: Selective Labeling via Error Bound Minimization »
Quanquan Gu · Tong Zhang · Chris Ding · Jiawei Han