Timezone: »
Optimization lies at the heart of many machine learning algorithms and enjoys great interest in our community. Indeed, this intimate relation of optimization with ML is the key motivation for the OPT series of workshops.
Looking back over the past decade, a strong trend is apparent: The intersection of OPT and ML has grown to the point that now cuttingedge advances in optimization often arise from the ML community. The distinctive feature of optimization within ML is its departure from textbook approaches, in particular, its focus on a different set of goals driven by "bigdata, nonconvexity, and highdimensions," where both theory and implementation are crucial.
We wish to use OPT 2020 as a platform to foster discussion, discovery, and dissemination of the stateoftheart in optimization as relevant to machine learning. And well beyond that: as a platform to identify new directions and challenges that will drive future research, and continue to build the OPT+ML joint research community.
Invited Speakers
Volkan Cevher (EPFL)
Michael Friedlander (UBC)
Donald Goldfarb (Columbia)
Andreas Krause (ETH, Zurich)
Suvrit Sra (MIT)
Rachel Ward (UT Austin)
Ashia Wilson (MSR)
Tong Zhang (HKUST)
Instructions
Please join us in gather.town for all breaks and poster sessions (Click "Open Link" on any break or poster session).
To see all submitted paper and posters, go to the "optml website" at the top of the page.
Use RocketChat or Zoom link (top of page) if you want to ask the speaker a direct question during the Live Q&A and Contributed Talks.
Fri 3:15 a.m.  3:50 a.m.

Welcome event (gather.town)
(Social event/Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗 
Fri 3:50 a.m.  4:00 a.m.

Welcome remarks to Session 1
(Opening remarks)

Sebastian Stich 🔗 
Fri 4:00 a.m.  4:20 a.m.

Invited speaker: The Convexity of Learning Infinitewidth Deep Neural Networks, Tong Zhang
(Talk)
SlidesLive Video » Deep learning has received considerable empirical successes in recent years. Although deep neural networks (DNNs) are highly nonconvex with respect to the model parameters, it has been observed that the training of overparametrized DNNs leads to consistent solutions that are highly reproducible with different random initializations. I will explain this phenomenon by modeling DNNs using feature representations, and show that the optimization landscape is convex with respect to the features. Moreover, we show that optimization with respect to the nonconvex DNN parameters leads to a global optimal solution under an idealized regularity condition, which can explain various empirical findings. 
Tong Zhang 🔗 
Fri 4:20 a.m.  4:30 a.m.

Live Q&A with Tong Zhang (Zoom)
(Q&A)

Sebastian Stich 🔗 
Fri 4:30 a.m.  4:50 a.m.

Invited speaker: Adaptation and universality in firstorder methods, Volkan Cevher
(Talk)
In this talk, we review some of the recent advances in firstorder methods for convex and nonconvex optimization as well as their universality properties. We say an algorithm is universal if it does not require to know whether the optimization objective is smooth or not. We first recall the AdaGrad method and show that AdaGrad is a universal algorithm without any modifications: It implicitly exploits the smoothness of the problem to achieve the standard O(1/k) rate in the presence of smooth objectives, where k is the iteration count. To this end, we introduce an accelerated, universal variant of AdaGrad, dubbed as AcceleGrad, that in addition obtains the optimal convergence rate of O(1/k^2) in the smooth setting with deterministic oracles. We then introduce UniXGrad, which is the first algorithm that simultaneously achieves optimal rates for smooth or nonsmooth problems with either deterministic or stochastic firstorder oracles in the constrained convex setting. We conclude the presentation with results in nonconvex optimization revolving around ADAMtype algorithms, including new convergence rates. 
Volkan Cevher 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed talks in Session 1 (Zoom)
(Multiple talks)
link »
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Laurent Condat, "Distributed Proximal Splitting Algorithms with Rates and Acceleration" Zhize Li, "PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization" Ohad Shamir, "Can We Find NearApproximatelyStationary Points of Nonsmooth Nonconvex Functions?" Tiffany Vlaar, "ConstraintBased Regularization of Neural Networks" Mohammadi Zaki, "Employing No Regret Learners for Pure Exploration in Linear Bandits" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Sebastian Stich · Laurent Condat · Zhize Li · Ohad Shamir · Tiffany Vlaar · Mohammadi Zaki 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: ConstraintBased Regularization of Neural Networks, Tiffany Vlaar
(Talk)
link »
SlidesLive Video » We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. Our methods see performance improvements on image classification tasks. 
Tiffany Vlaar 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Can We Find NearApproximatelyStationary Points of Nonsmooth Nonconvex Functions?, Ohad Shamir
(Talk)
link »
It is wellknown that given a bounded, smooth nonconvex function, standard gradientbased methods can find $\epsilon$stationary points (where the gradient norm is less than $\epsilon$) in $O(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently \emph{not} smooth, making these results inapplicable. Moreover, as recently pointed out in \citet{zhang2020complexity}, it is generally impossible to provide finitetime guarantees for finding an $\epsilon$stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are *near* such $\epsilon$stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only blackbox access to the function values and gradients. We also discuss the pros and cons of alternative approaches.

Ohad Shamir 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Employing No Regret Learners for Pure Exploration in Linear Bandits, Mohammadi Zaki
(Talk)
link »
SlidesLive Video »
We study the best arm identification problem in linear multiarmed bandits (LMAB) in the fixed confidence ($\delta$PAC) setting; this is also the problem of optimizing an unknown linear function over a discrete ground set with noisy, zerothorder access. We propose an explicitly implementable and provably orderoptimal samplecomplexity algorithm to solve this problem. Most previous approaches rely on access to a minimax optimization oracle which is at the heart of the complexity of the problem. We propose a method to solve this optimization problem (upto suitable accuracy) by interpreting the problem as a twoplayer zerosum game, and attempting to sequentially converge to its saddle point using lowregret learners to compute the players' strategies in each round which yields a concrete querying algorithm. The algorithm, which we call the {\em Phased Elimination Linear Exploration Game} (PELEG), maintains a highprobability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. We analyze the sample complexity of PELEG and show that it matches, up to order, an instancedependent lower bound on sample complexity in the linear bandit setting without requiring boundedness assumptions on the parameter space. PELEG is, thus, the first algorithm to achieve both orderoptimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.

Mohammadi Zaki 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: Distributed Proximal Splitting Algorithms with Rates and Acceleration, Laurent Condat
(Talk)
link »
SlidesLive Video » We propose new generic distributed proximal splitting algorithms, well suited for largescale convex nonsmooth optimization. We derive sublinear and linear convergence results with new nonergodic rates, as well as new accelerated versions of the algorithms, using varying stepsizes. 
Laurent Condat 🔗 
Fri 5:00 a.m.  5:30 a.m.

Contributed Video: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, Zhize Li
(Talk)
link »
SlidesLive Video »
In this paper, we propose a novel stochastic gradient estimatorProbAbilistic Gradient Estimator (PAGE)for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p$ and reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1p$. We give a simple formula for the optimal choice of $p$. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finitesum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak\L ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.

Zhize Li 🔗 
Fri 6:00 a.m.  6:50 a.m.

Poster Session 1 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Laurent Condat · Tiffany Vlaar · Ohad Shamir · Mohammadi Zaki · Zhize Li · GuanHorng Liu · Samuel Horváth · Mher Safaryan · Yoni Choukroun · kumar Shridhar · Nabil Kahale · Jikai Jin · Pratik Kumar Jawanpuria · Gaurav Kumar Yadav · Kazuki Koyama · Junyoung Kim · Xiao Li · Saugata Purkayastha · Adil Salim · Dighanchal Banerjee · Peter Richtarik · Lakshman Mahto · Tian Ye · Bamdev Mishra · Huikang Liu · JiaJie Zhu

Fri 6:50 a.m.  7:00 a.m.

Welcome remarks to Session 2
(Opening remarks)

Martin Takac 🔗 
Fri 7:00 a.m.  7:20 a.m.

Invited speaker: Adaptive Sampling for Stochastic RiskAverse Learning, Andreas Krause
(Talk)
SlidesLive Video » In highstakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional ValueatRisk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zerosum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which allows scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on largescale convex and nonconvex learning tasks. 
Andreas Krause 🔗 
Fri 7:20 a.m.  7:30 a.m.

Live Q&A with Andreas Krause (Zoom)
(Q&A)

Martin Takac 🔗 
Fri 7:30 a.m.  7:50 a.m.

Invited speaker: Practical Kroneckerfactored BFGS and LBFGS methods for training deep neural networks, Donald Goldfarb
(Talk)
SlidesLive Video » In training deep neural network (DNN) models, computing and storing a full BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an LBFGS implementation is impractical. In our methods, we approximate the Hessian by a blockdiagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and LBFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with stateoftheart firstorder stochastic methods. 
Donald Goldfarb 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed talks in Session 2 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Samuel Horvath, "Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization" GuanHorng Liu, "DDPNOpt: Differential Dynamic Programming Neural Optimizer" Nicolas Loizou, "Stochastic Polyak Stepsize for SGD: An Adaptive Learning Rate for Fast Convergence" Sharan Vaswani, "Adaptive Gradient Methods Converge Faster with OverParameterization (and you can do a linesearch)" Sharan Vaswani, "How to make your optimizer generalize better" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Martin Takac · Samuel Horváth · GuanHorng Liu · Nicolas Loizou · Sharan Vaswani 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization, Samuel Horvath
(Talk)
link »
SlidesLive Video » Adaptivity is an important yet understudied property in modern optimization theory. The gap between the stateoftheart theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as stepsize schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the ``geometrization'' technique introduced by \cite{lei2016less} and the \texttt{SARAH} algorithm of \cite{nguyen2017sarah}, we propose the Geometrized \texttt{SARAH} algorithm for nonconvex finitesum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak\L{}ojasiewicz (PL) constant, if present. In addition, it achieves the bestavailable convergence rate for nonPL objectives simultaneously while outperforming existing algorithms for PL objectives. 
Samuel Horváth 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Stochastic Polyak Stepsize for SGD: An Adaptive Learning Rate for Fast Convergence, Nicolas Loizou
(Talk)
link »
SlidesLive Video » We propose a stochastic variant of the classical Polyak stepsize \citep{polyak1987introduction} commonly used in the subgradient method. Although computing the Polyak stepsize requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak stepsize (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and nonconvex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant stepsize. We show that SPS is particularly effective when training overparameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problemdependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to stateoftheart optimization methods when training overparameterized models. 
Nicolas Loizou 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: DDPNOpt: Differential Dynamic Programming Neural Optimizer, GuanHorng Liu
(Talk)
link »
SlidesLive Video » Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by first showing that most widelyused algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated secondorder method rooted in trajectory optimization. In this vein, we propose a new class of optimizer, DDP Neural Optimizer (DDPNOpt), for training DNNs. DDPNOpt features layerwise feedback policies which improve convergence and robustness. It outperforms other optimalcontrol inspired training methods in both convergence and complexity, and is competitive against stateoftheart first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory. 
GuanHorng Liu 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: Adaptive Gradient Methods Converge Faster with OverParameterization (and you can do a linesearch), Sharan Vaswani
(Talk)
link »
SlidesLive Video »
Adaptive gradient methods are typically used for training overparameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant stepsize and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an $O(1)$ regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant stepsize AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the stepsize and requires tuning. We alleviate this problem by using stochastic linesearch (SLS) and Polyak's stepsizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problemdependent constants and retain the convergence guarantees of their constant stepsize counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks.

Sharan Vaswani 🔗 
Fri 8:00 a.m.  8:30 a.m.

Contributed Video: How to make your optimizer generalize better, Sharan Vaswani
(Talk)
link »
SlidesLive Video » We study the implicit regularization of optimization methods for linear models interpolating the training data in the underparametrized and overparametrized regimes. For overparameterized linear regression, where there are infinitely many interpolating solutions, different optimization methods can converge to solutions with varying generalization performance. In this setting, we show that projections onto linear spans can be used to move between solutions. Furthermore, via a simple reparameterization, we can ensure that an arbitrary optimizer converges to the minimum l2norm solution with favourable generalization properties. For underparameterized linear classification, optimizers can converge to different decision boundaries separating the data. We prove that for any such classifier, there exists a family of quadratic norms ._P such that the classifier's direction is the same as that of the maximum Pmargin solution. We argue that analyzing convergence to the standard maximum l2margin is arbitrary and show that minimizing the norm induced by the data can result in better generalization. We validate our theoretical results via experiments on synthetic and real datasets. 
Sharan Vaswani 🔗 
Fri 8:30 a.m.  9:00 a.m.

Break (gather.town)
(Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
🔗 
Fri 9:00 a.m.  9:20 a.m.

Invited speaker: SGD without replacement: optimal rate analysis and more, Suvrit Sra
(Talk)
SlidesLive Video » Stochastic gradient descent (SGD) is the workhorse of machine learning. There are two fundamental versions of SGD: (i) those that pick stochastic gradients with replacement, and (ii) those that pick without replacement. Ironically, version (ii) is what is used in practice (across most ML toolkits), while version (i) is what almost all published work analyzes. This mismatch is wellknown. It arises because analyzing SGD without replacement involves biased gradients and must cope with lack of independence between the stochastic gradients used. In this talk, I will present recent progress on analyzing without replacement SGD, the bulk of which will focus on minimax optimal convergence rates. The rates are obtained without assuming componentwise convexity. I will mention further refinements of the results assuming this additional convexity, which remove drawbacks common to previous works (such as large number of epochs required) 
Suvrit Sra 🔗 
Fri 9:20 a.m.  9:30 a.m.

Live Q&A with Suvrit Sra (Zoom)
(Q&A)

Martin Takac 🔗 
Fri 9:45 a.m.  10:50 a.m.

Poster Session 2 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Sharan Vaswani · Nicolas Loizou · Wenjie Li · Preetum Nakkiran · Zhan Gao · Sina Baghal · Jingfeng Wu · Roozbeh Yousefzadeh · Jinyi Wang · Jing Wang · Cong Xie · Anastasia Borovykh · Stanislaw Jastrzebski · Soham Dan · Yiliang Zhang · Mark Tuddenham · Sarath Pattathil · Ievgen Redko · Jeremy Cohen · Yasaman Esfandiari · Zhanhong Jiang · Mostafa ElAraby · Chulhee Yun · Michael Psenka · Robert Gower · Xiaoyu Wang

Fri 10:50 a.m.  11:00 a.m.

Welcome remarks to Session 3
(Opening remarks)

Mark Schmidt 🔗 
Fri 11:00 a.m.  11:20 a.m.

Invited speaker: Stochastic Geodesic Optimization, Ashia Wilson
(Talk)
Geodesic convexity offers a promising systematic way to handle nonconvexity for many problems of interest in statistics and computer science. The focus of this talk will be to describe efforts to extend the basic tools of convex optimization on Euclidean space to the general setting of Riemannian manifolds. We begin by motivating our focus on geodesic optimization with several examples, reviewing the basics of geodesic spaces and several techniques from optimization along the way. Particular attention will be given to optimization techniques which achieve oracle lower bounds for minimizing stochastic functions, namely accelerated methods. We end with a discussion of how one might adapt these techniques to the Riemannian setting. 
Ashia Wilson 🔗 
Fri 11:20 a.m.  11:30 a.m.

Live Q&A with Ashia Wilson (Zoom)
(Q&A)

Mark Schmidt 🔗 
Fri 11:30 a.m.  11:50 a.m.

Invited speaker: Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA, Rachel Ward
(Talk)
We present new nonasymptotic growth and concentration bounds for a product of independent random matrices, similar in spirit to concentration for sums of independent random matrices developed in the previous decade. Our matrix product concentration bounds provide a new, direct convergence proof of Oja's algorithm for streaming Principal Component Analysis, and should be useful more broadly for analyzing the convergence of stochastic gradient descent for certain classes of nonconvex optimization problems, including neural networks. This talk covers joint work with Amelia Henriksen, De Huang, Jon NilesWeed, and Joel Tropp. 
Rachel Ward 🔗 
Fri 11:50 a.m.  12:00 p.m.

Live Q&A with Rachel Ward (Zoom)
(Q&A)

Mark Schmidt 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed talks in Session 3 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Zhan Gao, "Incremental Greedy BFGS: An Incremental QuasiNewton Method with Explicit Superlinear Rate" Wenjie Li, "Variance Reduction on Adaptive Stochastic Mirror Descent" Preetum Nakkiran, "Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems" Denny Wu, "When Does Preconditioning Help or Hurt Generalization?" Chengrun Yang, "TenIPS: Inverse Propensity Sampling for Tensor Completion" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Variance Reduction on Adaptive Stochastic Mirror Descent, Wenjie Li
(Talk)
link »
SlidesLive Video » We study the application of the variance reduction technique on general adaptive stochastic mirror descent algorithms in nonsmooth nonconvex optimization problems. We prove that variance reduction helps to reduce the gradient complexity of most general stochastic mirror descent algorithms, so it works well with timevarying steps sizes and adaptive optimization algorithms such as AdaGrad. We check the validity of our claims using experiments in deep learning. 
Wenjie Li 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems, Preetum Nakkiran
(Talk)
link »
SlidesLive Video » Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li et al. (2019) recently proved this behavior can exist in a simplified nonconvex neuralnetwork setting. In this work, we show that this phenomenon can exist even for convex learning problems  in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and earlystopping. 
Preetum Nakkiran 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: When Does Preconditioning Help or Hurt Generalization?, Denny Wu
(Talk)
link »
SlidesLive Video »
While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of optimizers affects the comparison of generalization properties. We provide an exact biasvariance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and ``shape'' of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a wellspecified model, or aligned signal. Based on this analysis, we discuss approaches to manage the biasvariance tradeoff, and the benefit of interpolating between first and secondorder updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first and secondorder optimizers in neural network, and observe robust trends matching our theoretical analysis.

Denny Wu 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: Incremental Greedy BFGS: An Incremental QuasiNewton Method with Explicit Superlinear Rate, Zhan Gao
(Talk)
link »
SlidesLive Video » Finitesum minimization, i.e., problems where the objective may be written as the sum over a collection of instantaneous costs, are ubiquitous in modern machine learning and data science. Efficient numerical techniques for their solution must trade off perstep complexity with the number of steps required for convergence. Incremental QuasiNewton methods (IQN) achieve a favorable balance of these competing attributes in the sense that their complexity is independent of the sample size, while their convergence rate can be faster than linear. This local superlinear behavior, to date, however, is known only asymptotically. In this work, we put forth a new variant of IQN, specifically of the BroydenFletcherGoldfarbShanno (BFGS) type, that incorporates a greedy basis vector selection step, and admits a nonasymptotic explicit local superlinear rate. To the best of our knowledge, this is the first time an explicit superlinear rate has been given for QuasiNewton methods in the incremental setting. 
Zhan Gao 🔗 
Fri 12:00 p.m.  12:30 p.m.

Contributed Video: TenIPS: Inverse Propensity Sampling for Tensor Completion, Chengrun Yang
(Talk)
link »
SlidesLive Video » Tensors are widely used to model relationships among objects in a highdimensional space. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a randomized linear algebra approach, reweighting the observed tensor by the inverse propensities. We provide finitesample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. 
Chengrun Yang 🔗 
Fri 12:30 p.m.  1:30 p.m.

Break (gather.town)
(Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
🔗 
Fri 1:30 p.m.  1:50 p.m.

Invited speaker: Fast convergence of stochastic subgradient method under interpolation, Michael Friedlander
(Talk)
SlidesLive Video » This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to overparameterized empiricalrisk optimization models that exactly fit the training data. We prove that for models with composite structures often found in neural networks, the interpolation condition implies that the model is effectively smooth at all minimizers, and therefore that SSGD converges at rates normally achievable only for smooth convex problems. We also prove that the fast rates we derive are optimal for any subgradient method applied to convex problems where interpolation holds. This is joint work with Huang Fang and Zhenan Fan. 
Michael Friedlander 🔗 
Fri 1:30 p.m.  1:35 p.m.

Intro to Invited Speaker 8
(Organizer intro)

Mark Schmidt 🔗 
Fri 1:50 p.m.  2:00 p.m.

Live Q&A with Michael Friedlander (Zoom)
(Q&A)

Mark Schmidt 🔗 
Fri 2:00 p.m.  2:50 p.m.

Poster Session 3 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. 
Denny Wu · Chengrun Yang · Tolga Ergen · sanae lotfi · Charles GuilleEscuret · Boris Ginsburg · Hanbake Lyu · Cong Xie · David Newton · Debraj Basu · Yewen Wang · James Lucas · MAOJIA LI · Lijun Ding · Jose Javier Gonzalez Ortiz · Reyhane Askari Hemmat · Zhiqi Bu · Neal Lawton · Kiran Thekumparampil · Jiaming Liang · Lindon Roberts · Jingyi Zhu · Dongruo Zhou

Fri 2:50 p.m.  3:00 p.m.

Welcome remarks to Session 4
(Opening remarks)

Quanquan Gu 🔗 
Fri 3:00 p.m.  3:20 p.m.

Invited speaker: Online nonnegative matrix factorization for Markovian and other real data, Deanna Needell and Hanbaek Lyu
(Talk)
SlidesLive Video » Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a nonconvex generalization of the wellknown OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online nonnegative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on realworld data and discuss recent extensions and variations. 
Hanbake Lyu · Deanna Needell 🔗 
Fri 3:20 p.m.  3:30 p.m.

Live Q&A with Deanna Needell and Hanbake Lyu (Zoom)
(Q&A)

Quanquan Gu 🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed talks in Session 4 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Tolga Ergen, "Convex Programs for Global Optimization of Convolutional Neural Networks in PolynomialTime" Charles GuilleEscuret, "A Study of Condition Numbers for FirstOrder Optimization" Lewis Liu, "AffineInvariant Analysis of FrankWolfe on Strongly Convex Sets" Sanae Lotfi, "Stochastic Damped LBFGS with controlled norm of the Hessian approximation" Dongruo Zhou, "On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. 
Quanquan Gu · sanae lotfi · Charles GuilleEscuret · Tolga Ergen · Dongruo Zhou 🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: Stochastic Damped LBFGS with controlled norm of the Hessian approximation, Sanae Lotfi
(Talk)
link »
SlidesLive Video » We propose a new stochastic variancereduced damped LBFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped LBFGS algorithm called SdLBFGS. We establish almost sure converges to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGSVR and SVRG on a modified DavidNet problem  a highly nonconvex and illconditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex supportvector machine problem. 
sanae lotfi 🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: A Study of Condition Numbers for FirstOrder Optimization, Charles GuilleEscuret
(Talk)
link »
SlidesLive Video » In this work we introduce a new framework for the theoretical study of convergence and tuning of firstorder optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy.Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. 
Charles GuilleEscuret 🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: AffineInvariant Analysis of FrankWolfe on Strongly Convex Sets, Lewis Liu
(Talk)
link »
SlidesLive Video »
When the constraint set $\mathcal{C}$ is strongly convex, the FrankWolfe algorithm, which is affine covariant, enjoys accelerated convergence rates. In contrast, existing results rely on normdependent assumptions, usually incurring nonaffine invariant bounds. In this work, we introduce new structural assumptions on the problem and derive an affine invariant, normindependent analysis of Frank Wolfe. Based on our analysis, we propose an affineinvariant backtracking linesearch. Interestingly, we show that typical backtracking linesearches using smoothness of the objective function surprisingly converge to an affineinvariant stepsize, despite using affinedependent norms in the computation of the stepsize.

🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: Convex Programs for Global Optimization of Convolutional Neural Networks in PolynomialTime, Tolga Ergen
(Talk)
link »
SlidesLive Video »
We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons and data dimension. Particularly, we develop a convex analytic framework utilizing semiinfinite duality to obtain equivalent convex optimization problems for twolayer CNNs, where convex problems are regularized by the sum of $\ell_2$ norms of variables.

Tolga Ergen 🔗 
Fri 3:30 p.m.  4:00 p.m.

Contributed Video: On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization, Dongruo Zhou
(Talk)
link »
SlidesLive Video » Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a finegrained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a firstorder stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. 
Dongruo Zhou 🔗 
Fri 4:00 p.m.  4:05 p.m.

Closing remarks

Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗 
Author Information
Courtney Paquette (Lehigh University)
Mark Schmidt (University of British Columbia)
Sebastian Stich (EPFL)
Dr. [Sebastian U. Stich](https://sstich.ch/) is a faculty at the CISPA Helmholtz Center for Information Security. Research interests:  *methods for machine learning and statistics*—at the interface of theory and practice  *collaborative learning* (distributed, federated and decentralized methods)  *optimization for machine learning* (adaptive stochastic methods and generalization performance)
Quanquan Gu (UCLA)
Martin Takac (Lehigh University)
More from the Same Authors

2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu 
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu 
2021 : Heavytailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt 
2021 : Heavytailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt 
2021 : Randomreshuffled SARAH does not need a full gradient computations »
Aleksandr Beznosikov · Martin Takac 
2021 : Escaping Local Minima With Stochastic Noise »
Harshvardhan Harshvardhan · Sebastian Stich 
2021 : Faster Perturbed Stochastic Gradient Methods for Finding Local Minima »
Zixiang Chen · Dongruo Zhou · Quanquan Gu 
2021 : Faster QuasiNewton Methods for Linear Composition Problems »
Betty Shea · Mark Schmidt 
2021 : A Closer Look at Gradient Estimators with Reinforcement Learning as Inference »
Jonathan Lavington · Michael Teng · Mark Schmidt · Frank Wood 
2021 : An Empirical Study of NonUniform Sampling in OffPolicy Reinforcement Learning for Continuous Control »
Nicholas Ioannidis · Jonathan Lavington · Mark Schmidt 
2021 : Learning TwoPlayer Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium »
Chris Junchi Li · Dongruo Zhou · Quanquan Gu · Michael Jordan 
2021 : The Peril of Popular Deep Learning Uncertainty Estimation Methods »
Yehao Liu · Matteo Pagliardini · Tatjana Chavdarova · Sebastian Stich 
2022 : Targetbased Surrogates for Stochastic Optimization »
Jonathan Lavington · Sharan Vaswani · Reza Babanezhad Harikandeh · Mark Schmidt · Nicolas Le Roux 
2022 : Fast Convergence of Greedy 2Coordinate Updates for Optimizing with an Equality Constraint »
Amrutha Varshini Ramesh · Aaron Mishkin · Mark Schmidt 
2022 : Fast Convergence of Random Reshuffling under Interpolation and the PolyakŁojasiewicz Condition »
Chen Fan · Christos Thrampoulidis · Mark Schmidt 
2022 : Effects of momentum scaling for SGD »
Dmitry A. Pasechnyuk · Alexander Gasnikov · Martin Takac 
2022 : Using quadratic equations for overparametrized models »
Shuang Li · William Swartworth · Martin Takac · Deanna Needell · Robert Gower 
2022 : FLECSCGD: A Federated Learning SecondOrder Framework via Compression and Sketching with Compressed Gradient Differences »
Artem Agafonov · Brahim Erraji · Martin Takac 
2022 : Cubic Regularized QuasiNewton Methods »
Dmitry Kamzolov · Klea Ziu · Artem Agafonov · Martin Takac 
2022 : PSPS: Preconditioned Stochastic Polyak Stepsize method for badly scaled data »
Farshed Abdukhakimov · Chulu Xiang · Dmitry Kamzolov · Robert Gower · Martin Takac 
2022 : Practical Structured Riemannian Optimization with Momentum by using Generalized Normal Coordinates »
Wu Lin · Valentin Duruisseaux · Melvin Leok · Frank Nielsen · Mohammad Emtiyaz Khan · Mark Schmidt 
2022 : A General Framework for SampleEfficient Function Approximation in Reinforcement Learning »
Zixiang Chen · Chris Junchi Li · Angela Yuan · Quanquan Gu · Michael Jordan 
2022 Spotlight: Risk Bounds of MultiPass SGD for Least Squares in the Interpolation Regime »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Sham Kakade 
2022 : Contributed Talks 2 »
Quanquan Gu · Aaron Defazio · Jiajin Li 
2022 Workshop: OPT 2022: Optimization for Machine Learning »
Courtney Paquette · Sebastian Stich · Quanquan Gu · Cristóbal Guzmán · John Duchi 
2022 Workshop: Order up! The Benefits of HigherOrder Optimization in Machine Learning »
Albert Berahas · Jelena Diakonikolas · Jarad Forristal · Brandon Reese · Martin Takac · Yan Xu 
2022 Poster: Towards Understanding the MixtureofExperts Layer in Deep Learning »
Zixiang Chen · Yihe Deng · Yue Wu · Quanquan Gu · Yuanzhi Li 
2022 Poster: Computationally Efficient HorizonFree Reinforcement Learning for Linear Mixture MDPs »
Dongruo Zhou · Quanquan Gu 
2022 Poster: Benign Overfitting in Twolayer Convolutional Neural Networks »
Yuan Cao · Zixiang Chen · Misha Belkin · Quanquan Gu 
2022 Poster: Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions »
Courtney Paquette · Elliot Paquette · Ben Adlam · Jeffrey Pennington 
2022 Poster: Learning TwoPlayer Markov Games: Neural Function Approximation and Correlated Equilibrium »
Chris Junchi Li · Dongruo Zhou · Quanquan Gu · Michael Jordan 
2022 Poster: A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits »
Jiafan He · Tianhao Wang · Yifei Min · Quanquan Gu 
2022 Poster: The Power and Limitation of PretrainingFinetuning for Linear Regression under Covariate Shift »
Jingfeng Wu · Difan Zou · Vladimir Braverman · Quanquan Gu · Sham Kakade 
2022 Poster: Risk Bounds of MultiPass SGD for Least Squares in the Interpolation Regime »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Sham Kakade 
2022 Poster: A Damped Newton Method Achieves Global $\mathcal O \left(\frac{1}{k^2}\right)$ and Local Quadratic Convergence Rate »
Slavomír Hanzely · Dmitry Kamzolov · Dmitry Pasechnyuk · Alexander Gasnikov · Peter Richtarik · Martin Takac 
2022 Poster: Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions »
Jiafan He · Dongruo Zhou · Tong Zhang · Quanquan Gu 
2022 Poster: Trajectory of MiniBatch Momentum: Batch Size Saturation and Convergence in High Dimensions »
Kiwon Lee · Andrew Cheng · Elliot Paquette · Courtney Paquette 
2022 Poster: Active Ranking without Strong Stochastic Transitivity »
Hao Lou · Tao Jin · Yue Wu · Pan Xu · Quanquan Gu · Farzad Farnoud 
2021 : Contributed talks in Session 4 (Zoom) »
Quanquan Gu · Agnieszka Słowik · Jacques Chen · Neha Wadia · Difan Zou 
2021 : Opening Remarks to Session 4 »
Quanquan Gu 
2021 : Contributed Talks in Session 1 (Zoom) »
Sebastian Stich · Futong Liu · Abdurakhmon Sadiev · Frederik Benzing · Simon Roburin 
2021 : Opening Remarks to Session 1 »
Sebastian Stich 
2021 Workshop: OPT 2021: Optimization for Machine Learning »
Courtney Paquette · Quanquan Gu · Oliver Hinder · Katya Scheinberg · Sebastian Stich · Martin Takac 
2021 Poster: The Benefits of Implicit Regularization from SGD in Least Squares Problems »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Dean Foster · Sham Kakade 
2021 Poster: UniformPAC Bounds for Reinforcement Learning with Linear Function Approximation »
Jiafan He · Dongruo Zhou · Quanquan Gu 
2021 Poster: Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent »
Spencer Frei · Quanquan Gu 
2021 Poster: Breaking the centralized barrier for crossdevice federated learning »
Sai Praneeth Karimireddy · Martin Jaggi · Satyen Kale · Mehryar Mohri · Sashank Reddi · Sebastian Stich · Ananda Theertha Suresh 
2021 Poster: Risk Bounds for Overparameterized Maximum Margin Classification on SubGaussian Mixtures »
Yuan Cao · Quanquan Gu · Mikhail Belkin 
2021 Poster: Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs »
Jiafan He · Dongruo Zhou · Quanquan Gu 
2021 Poster: RewardFree ModelBased Reinforcement Learning with Linear Function Approximation »
Weitong ZHANG · Dongruo Zhou · Quanquan Gu 
2021 Poster: VarianceAware OffPolicy Evaluation with Linear Function Approximation »
Yifei Min · Tianhao Wang · Dongruo Zhou · Quanquan Gu 
2021 Poster: RelaySum for Decentralized Deep Learning on Heterogeneous Data »
Thijs Vogels · Lie He · Anastasiia Koloskova · Sai Praneeth Karimireddy · Tao Lin · Sebastian Stich · Martin Jaggi 
2021 Poster: Iterative TeacherAware Learning »
Luyao Yuan · Dongruo Zhou · Junhong Shen · Jingdong Gao · Jeffrey L Chen · Quanquan Gu · Ying Nian Wu · SongChun Zhu 
2021 Poster: An Improved Analysis of Gradient Tracking for Decentralized Machine Learning »
Anastasiia Koloskova · Tao Lin · Sebastian Stich 
2021 Poster: Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints »
Tianhao Wang · Dongruo Zhou · Quanquan Gu 
2021 Poster: Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks »
Hanxun Huang · Yisen Wang · Sarah Erfani · Quanquan Gu · James Bailey · Xingjun Ma 
2021 Poster: Do Wider Neural Networks Really Help Adversarial Robustness? »
Boxi Wu · Jinghui Chen · Deng Cai · Xiaofei He · Quanquan Gu 
2021 Poster: Pure Exploration in Kernel and Neural Bandits »
Yinglun Zhu · Dongruo Zhou · Ruoxi Jiang · Quanquan Gu · Rebecca Willett · Robert Nowak 
2021 Poster: Dynamics of Stochastic Momentum Methods on Largescale, Quadratic Models »
Courtney Paquette · Elliot Paquette 
2020 : Closing remarks »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 
2020 : Contributed talks in Session 4 (Zoom) »
Quanquan Gu · sanae lotfi · Charles GuilleEscuret · Tolga Ergen · Dongruo Zhou 
2020 : Live Q&A with Deanna Needell and Hanbake Lyu (Zoom) »
Quanquan Gu 
2020 : Welcome remarks to Session 4 »
Quanquan Gu 
2020 : Live Q&A with Michael Friedlander (Zoom) »
Mark Schmidt 
2020 : Intro to Invited Speaker 8 »
Mark Schmidt 
2020 : Contributed talks in Session 3 (Zoom) »
Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang 
2020 : Live Q&A with Rachel Ward (Zoom) »
Mark Schmidt 
2020 : Live Q&A with Ashia Wilson (Zoom) »
Mark Schmidt 
2020 : Welcome remarks to Session 3 »
Mark Schmidt 
2020 : Live Q&A with Suvrit Sra (Zoom) »
Martin Takac 
2020 : Intro to Invited Speaker 5 »
Martin Takac 
2020 : Contributed talks in Session 2 (Zoom) »
Martin Takac · Samuel Horváth · GuanHorng Liu · Nicolas Loizou · Sharan Vaswani 
2020 : Live Q&A with Donald Goldfarb (Zoom) »
Martin Takac 
2020 : Live Q&A with Andreas Krause (Zoom) »
Martin Takac 
2020 : Welcome remarks to Session 2 »
Martin Takac 
2020 : Contributed talks in Session 1 (Zoom) »
Sebastian Stich · Laurent Condat · Zhize Li · Ohad Shamir · Tiffany Vlaar · Mohammadi Zaki 
2020 : Live Q&A with Volkan Cevher (Zoom) »
Sebastian Stich 
2020 : Live Q&A with Tong Zhang (Zoom) »
Sebastian Stich 
2020 : Welcome remarks to Session 1 »
Sebastian Stich 
2020 : Welcome event (gather.town) »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 
2020 Poster: A Generalized Neural Tangent Kernel Analysis for Twolayer Neural Networks »
Zixiang Chen · Yuan Cao · Quanquan Gu · Tong Zhang 
2020 Poster: Regret Bounds without Lipschitz Continuity: Online Learning with RelativeLipschitz Losses »
Yihan Zhou · Victor Sanches Portella · Mark Schmidt · Nicholas Harvey 
2020 Poster: Ensemble Distillation for Robust Model Fusion in Federated Learning »
Tao Lin · Lingjing Kong · Sebastian Stich · Martin Jaggi 
2020 Poster: Agnostic Learning of a Single Neuron with Gradient Descent »
Spencer Frei · Yuan Cao · Quanquan Gu 
2020 Poster: A FiniteTime Analysis of Two TimeScale ActorCritic Methods »
Yue Wu · Weitong ZHANG · Pan Xu · Quanquan Gu 
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren AmdahlCulleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · MariusConstantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · PierreLuc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · XiaoWen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · ChenYu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall 
2019 : Poster Session »
Eduard Gorbunov · Alexandre d'Aspremont · Lingxiao Wang · Liwei Wang · Boris Ginsburg · Alessio Quaglino · Camille Castera · Saurabh Adya · Diego Granziol · Rudrajit Das · Raghu Bollapragada · Fabian Pedregosa · Martin Takac · Majid Jahani · Sai Praneeth Karimireddy · Hilal Asi · Balint Daroczy · Leonard Adolphs · Aditya Rawal · Nicolas Brandt · Minhan Li · Giuseppe Ughi · Orlando Romero · Ivan Skorokhodov · Damien Scieur · Kiwook Bae · Konstantin Mishchenko · Rohan Anil · Vatsal Sharan · Aditya Balu · Chao Chen · Zhewei Yao · Tolga Ergen · Paul Grigas · Chris Junchi Li · Jimmy Ba · Stephen J Roberts · Sharan Vaswani · Armin Eftekhari · Chhavi Sharma 
2019 Poster: AlgorithmDependent Generalization Bounds for Overparameterized Deep Residual Networks »
Spencer Frei · Yuan Cao · Quanquan Gu 
2019 Poster: LayerDependent Importance Sampling for Training Deep and Large Graph Convolutional Networks »
Difan Zou · Ziniu Hu · Yewen Wang · Song Jiang · Yizhou Sun · Quanquan Gu 
2019 Poster: Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates »
Sharan Vaswani · Aaron Mishkin · Issam Laradji · Mark Schmidt · Gauthier Gidel · Simon LacosteJulien 
2019 Poster: Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction »
Difan Zou · Pan Xu · Quanquan Gu 
2019 Poster: Tight Sample Complexity of Learning Onehiddenlayer Convolutional Neural Networks »
Yuan Cao · Quanquan Gu 
2019 Poster: An Improved Analysis of Training Overparameterized Deep Neural Networks »
Difan Zou · Quanquan Gu 
2019 Poster: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu 
2019 Spotlight: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu 
2018 Poster: Thirdorder Smoothness Helps: Faster Stochastic Optimization Algorithms for Finding Local Minima »
Yaodong Yu · Pan Xu · Quanquan Gu 
2018 Poster: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu 
2018 Poster: SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient »
Aaron Mishkin · Frederik Kunstner · Didrik Nielsen · Mark Schmidt · Mohammad Emtiyaz Khan 
2018 Spotlight: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu 
2018 Poster: Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster SecondOrder Optimization »
Robert Gower · Filip Hanzely · Peter Richtarik · Sebastian Stich 
2018 Poster: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu 
2018 Poster: Reinforcement Learning for Solving the Vehicle Routing Problem »
MohammadReza Nazari · Afshin Oroojlooy · Lawrence Snyder · Martin Takac 
2018 Spotlight: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu 
2018 Poster: Sparsified SGD with Memory »
Sebastian Stich · JeanBaptiste Cordonnier · Martin Jaggi 
2018 Poster: Distributed Learning without Distress: PrivacyPreserving Empirical Risk Minimization »
Bargav Jayaraman · Lingxiao Wang · David Evans · Quanquan Gu 
2017 Poster: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi 
2017 Spotlight: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi 
2017 Poster: Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization »
Pan Xu · Jian Ma · Quanquan Gu 
2016 : Fast Patchbased Style Transfer of Arbitrary Style »
Tian Qi Chen · Mark Schmidt 
2016 Poster: Semiparametric Differential Graph Models »
Pan Xu · Quanquan Gu 
2016 Poster: A MultiBatch LBFGS Method for Machine Learning »
Albert Berahas · Jorge Nocedal · Martin Takac 
2015 Poster: High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality »
Zhaoran Wang · Quanquan Gu · Yang Ning · Han Liu 
2015 Poster: StopWasting My Gradients: Practical SVRG »
Reza Babanezhad Harikandeh · Mohamed Osama Ahmed · Alim Virani · Mark Schmidt · Jakub Konečný · Scott Sallinen 
2014 Poster: CommunicationEfficient Distributed Dual Coordinate Ascent »
Martin Jaggi · Virginia Smith · Martin Takac · Jonathan Terhorst · Sanjay Krishnan · Thomas Hofmann · Michael Jordan 
2014 Poster: Sparse PCA with Oracle Property »
Quanquan Gu · Zhaoran Wang · Han Liu 
2014 Poster: Robust Tensor Decomposition with Gross Corruption »
Quanquan Gu · Huan Gui · Jiawei Han 
2012 Poster: Selective Labeling via Error Bound Minimization »
Quanquan Gu · Tong Zhang · Chris Ding · Jiawei Han