Timezone: »
Optimization lies at the heart of many machine learning algorithms and enjoys great interest in our community. Indeed, this intimate relation of optimization with ML is the key motivation for the OPT series of workshops.
Looking back over the past decade, a strong trend is apparent: The intersection of OPT and ML has grown to the point that now cutting-edge advances in optimization often arise from the ML community. The distinctive feature of optimization within ML is its departure from textbook approaches, in particular, its focus on a different set of goals driven by "big-data, nonconvexity, and high-dimensions," where both theory and implementation are crucial.
We wish to use OPT 2020 as a platform to foster discussion, discovery, and dissemination of the state-of-the-art in optimization as relevant to machine learning. And well beyond that: as a platform to identify new directions and challenges that will drive future research, and continue to build the OPT+ML joint research community.
Invited Speakers
Volkan Cevher (EPFL)
Michael Friedlander (UBC)
Donald Goldfarb (Columbia)
Andreas Krause (ETH, Zurich)
Suvrit Sra (MIT)
Rachel Ward (UT Austin)
Ashia Wilson (MSR)
Tong Zhang (HKUST)
Instructions
Please join us in gather.town for all breaks and poster sessions (Click "Open Link" on any break or poster session).
To see all submitted paper and posters, go to the "opt-ml website" at the top of the page.
Use RocketChat or Zoom link (top of page) if you want to ask the speaker a direct question during the Live Q&A and Contributed Talks.
Fri 3:15 a.m. - 3:50 a.m.
|
Welcome event (gather.town)
(Social event/Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗 |
Fri 3:50 a.m. - 4:00 a.m.
|
Welcome remarks to Session 1
(Opening remarks)
|
Sebastian Stich 🔗 |
Fri 4:00 a.m. - 4:20 a.m.
|
Invited speaker: The Convexity of Learning Infinite-width Deep Neural Networks, Tong Zhang
(Talk)
SlidesLive Video » Deep learning has received considerable empirical successes in recent years. Although deep neural networks (DNNs) are highly nonconvex with respect to the model parameters, it has been observed that the training of overparametrized DNNs leads to consistent solutions that are highly reproducible with different random initializations. I will explain this phenomenon by modeling DNNs using feature representations, and show that the optimization landscape is convex with respect to the features. Moreover, we show that optimization with respect to the nonconvex DNN parameters leads to a global optimal solution under an idealized regularity condition, which can explain various empirical findings. |
Tong Zhang 🔗 |
Fri 4:20 a.m. - 4:30 a.m.
|
Live Q&A with Tong Zhang (Zoom)
(Q&A)
|
Sebastian Stich 🔗 |
Fri 4:30 a.m. - 4:50 a.m.
|
Invited speaker: Adaptation and universality in first-order methods, Volkan Cevher
(Talk)
In this talk, we review some of the recent advances in first-order methods for convex and non-convex optimization as well as their universality properties. We say an algorithm is universal if it does not require to know whether the optimization objective is smooth or not. We first recall the AdaGrad method and show that AdaGrad is a universal algorithm without any modifications: It implicitly exploits the smoothness of the problem to achieve the standard O(1/k) rate in the presence of smooth objectives, where k is the iteration count. To this end, we introduce an accelerated, universal variant of AdaGrad, dubbed as AcceleGrad, that in addition obtains the optimal convergence rate of O(1/k^2) in the smooth setting with deterministic oracles. We then introduce UniXGrad, which is the first algorithm that simultaneously achieves optimal rates for smooth or non-smooth problems with either deterministic or stochastic first-order oracles in the constrained convex setting. We conclude the presentation with results in non-convex optimization revolving around ADAM-type algorithms, including new convergence rates. |
Volkan Cevher 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed talks in Session 1 (Zoom)
(Multiple talks)
link »
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Laurent Condat, "Distributed Proximal Splitting Algorithms with Rates and Acceleration" Zhize Li, "PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization" Ohad Shamir, "Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?" Tiffany Vlaar, "Constraint-Based Regularization of Neural Networks" Mohammadi Zaki, "Employing No Regret Learners for Pure Exploration in Linear Bandits" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. |
Sebastian Stich · Laurent Condat · Zhize Li · Ohad Shamir · Tiffany Vlaar · Mohammadi Zaki 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed Video: Constraint-Based Regularization of Neural Networks, Tiffany Vlaar
(Talk)
link »
SlidesLive Video » We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. Our methods see performance improvements on image classification tasks. |
Tiffany Vlaar 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed Video: Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions?, Ohad Shamir
(Talk)
link »
It is well-known that given a bounded, smooth nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (where the gradient norm is less than $\epsilon$) in $O(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently \emph{not} smooth, making these results inapplicable. Moreover, as recently pointed out in \citet{zhang2020complexity}, it is generally impossible to provide finite-time guarantees for finding an $\epsilon$-stationary point of nonsmooth functions. Perhaps the most natural relaxation of this is to find points which are *near* such $\epsilon$-stationary points. In this paper, we show that even this relaxed goal is hard to obtain in general, given only black-box access to the function values and gradients. We also discuss the pros and cons of alternative approaches.
|
Ohad Shamir 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed Video: Employing No Regret Learners for Pure Exploration in Linear Bandits, Mohammadi Zaki
(Talk)
link »
SlidesLive Video »
We study the best arm identification problem in linear multi-armed bandits (LMAB) in the fixed confidence ($\delta$-PAC) setting; this is also the problem of optimizing an unknown linear function over a discrete ground set with noisy, zeroth-order access. We propose an explicitly implementable and provably order-optimal sample-complexity algorithm to solve this problem. Most previous approaches rely on access to a minimax optimization oracle which is at the heart of the complexity of the problem. We propose a method to solve this optimization problem (upto suitable accuracy) by interpreting the problem as a two-player zero-sum game, and attempting to sequentially converge to its saddle point using low-regret learners to compute the players' strategies in each round which yields a concrete querying algorithm. The algorithm, which we call the {\em Phased Elimination Linear Exploration Game} (PELEG), maintains a high-probability confidence ellipsoid containing $\theta^*$ in each round and uses it to eliminate suboptimal arms in phases. We analyze the sample complexity of PELEG and show that it matches, up to order, an instance-dependent lower bound on sample complexity in the linear bandit setting without requiring boundedness assumptions on the parameter space. PELEG is, thus, the first algorithm to achieve both order-optimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
|
Mohammadi Zaki 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed Video: Distributed Proximal Splitting Algorithms with Rates and Acceleration, Laurent Condat
(Talk)
link »
SlidesLive Video » We propose new generic distributed proximal splitting algorithms, well suited for large-scale convex nonsmooth optimization. We derive sublinear and linear convergence results with new nonergodic rates, as well as new accelerated versions of the algorithms, using varying stepsizes. |
Laurent Condat 🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Contributed Video: PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization, Zhize Li
(Talk)
link »
SlidesLive Video »
In this paper, we propose a novel stochastic gradient estimator---ProbAbilistic Gradient Estimator (PAGE)---for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p$ and reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p$. We give a simple formula for the optimal choice of $p$. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finite-sum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak-\L ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.
|
Zhize Li 🔗 |
Fri 6:00 a.m. - 6:50 a.m.
|
Poster Session 1 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
Laurent Condat · Tiffany Vlaar · Ohad Shamir · Mohammadi Zaki · Zhize Li · Guan-Horng Liu · Samuel Horváth · Mher Safaryan · Yoni Choukroun · Kumar Shridhar · Nabil Kahale · Jikai Jin · Pratik Kumar Jawanpuria · Gaurav Kumar Yadav · Kazuki Koyama · Junyoung Kim · Xiao Li · Saugata Purkayastha · Adil Salim · Dighanchal Banerjee · Peter Richtarik · Lakshman Mahto · Tian Ye · Bamdev Mishra · Huikang Liu · Jiajie Zhu
|
Fri 6:50 a.m. - 7:00 a.m.
|
Welcome remarks to Session 2
(Opening remarks)
|
Martin Takac 🔗 |
Fri 7:00 a.m. - 7:20 a.m.
|
Invited speaker: Adaptive Sampling for Stochastic Risk-Averse Learning, Andreas Krause
(Talk)
SlidesLive Video » In high-stakes machine learning applications, it is crucial to not only perform well on average, but also when restricted to difficult examples. To address this, we consider the problem of training models in a risk averse manner. We propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players, and solve it efficiently using regret minimization. Our approach relies on sampling from structured Determinantal Point Processes (DPPs), which allows scaling it to large data sets. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks. |
Andreas Krause 🔗 |
Fri 7:20 a.m. - 7:30 a.m.
|
Live Q&A with Andreas Krause (Zoom)
(Q&A)
|
Martin Takac 🔗 |
Fri 7:30 a.m. - 7:50 a.m.
|
Invited speaker: Practical Kronecker-factored BFGS and L-BFGS methods for training deep neural networks, Donald Goldfarb
(Talk)
SlidesLive Video » In training deep neural network (DNN) models, computing and storing a full BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is impractical. In our methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices, analogous to the approach in KFAC for approximating the Fisher matrix in a stochastic natural gradient method. Because of the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the BFGS and L-BFGS approximations bounded, both above and below. In tests on autoencoder feed forward and convolutional neural network models, our methods outperformed KFAC and were competitive with state-of-the-art first-order stochastic methods. |
Donald Goldfarb 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed talks in Session 2 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Samuel Horvath, "Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization" Guan-Horng Liu, "DDPNOpt: Differential Dynamic Programming Neural Optimizer" Nicolas Loizou, "Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence" Sharan Vaswani, "Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)" Sharan Vaswani, "How to make your optimizer generalize better" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. |
Martin Takac · Samuel Horváth · Guan-Horng Liu · Nicolas Loizou · Sharan Vaswani 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed Video: Adaptivity of Stochastic Gradient Methods for Nonconvex Optimization, Samuel Horvath
(Talk)
link »
SlidesLive Video » Adaptivity is an important yet under-studied property in modern optimization theory. The gap between the state-of-the-art theory and the current practice is striking in that algorithms with desirable theoretical guarantees typically involve drastically different settings of hyperparameters, such as step-size schemes and batch sizes, in different regimes. Despite the appealing theoretical results, such divisive strategies provide little, if any, insight to practitioners to select algorithms that work broadly without tweaking the hyperparameters. In this work, blending the ``geometrization'' technique introduced by \cite{lei2016less} and the \texttt{SARAH} algorithm of \cite{nguyen2017sarah}, we propose the Geometrized \texttt{SARAH} algorithm for non-convex finite-sum and stochastic optimization. Our algorithm is proved to achieve adaptivity to both the magnitude of the target accuracy and the Polyak-\L{}ojasiewicz (PL) constant, if present. In addition, it achieves the best-available convergence rate for non-PL objectives simultaneously while outperforming existing algorithms for PL objectives. |
Samuel Horváth 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed Video: Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence, Nicolas Loizou
(Talk)
link »
SlidesLive Video » We propose a stochastic variant of the classical Polyak step-size \citep{polyak1987introduction} commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models. |
Nicolas Loizou 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed Video: DDPNOpt: Differential Dynamic Programming Neural Optimizer, Guan-Horng Liu
(Talk)
link »
SlidesLive Video » Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by first showing that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order method rooted in trajectory optimization. In this vein, we propose a new class of optimizer, DDP Neural Optimizer (DDPNOpt), for training DNNs. DDPNOpt features layer-wise feedback policies which improve convergence and robustness. It outperforms other optimal-control inspired training methods in both convergence and complexity, and is competitive against state-of-the-art first and second order methods. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory. |
Guan-Horng Liu 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed Video: Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search), Sharan Vaswani
(Talk)
link »
SlidesLive Video »
Adaptive gradient methods are typically used for training over-parameterized models capable of exactly fitting the data; we thus study their convergence in this interpolation setting. Under an interpolation assumption, we prove that AMSGrad with a constant step-size and momentum can converge to the minimizer at the faster $O(1/T)$ rate for smooth, convex functions. Furthermore, in this setting, we show that AdaGrad can achieve an $O(1)$ regret in the online convex optimization framework. When interpolation is only approximately satisfied, we show that constant step-size AMSGrad converges to a neighbourhood of the solution. On the other hand, we prove that AdaGrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the step-size and requires tuning. We alleviate this problem by using stochastic line-search (SLS) and Polyak's step-sizes (SPS) to help these methods adapt to the function's local smoothness. By using these techniques, we prove that AdaGrad and AMSGrad do not require knowledge of problem-dependent constants and retain the convergence guarantees of their constant step-size counterparts. Experimentally, we show that these techniques help improve the convergence and generalization performance across tasks, from binary classification with kernel mappings to classification with deep neural networks.
|
Sharan Vaswani 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Contributed Video: How to make your optimizer generalize better, Sharan Vaswani
(Talk)
link »
SlidesLive Video » We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. For over-parameterized linear regression, where there are infinitely many interpolating solutions, different optimization methods can converge to solutions with varying generalization performance. In this setting, we show that projections onto linear spans can be used to move between solutions. Furthermore, via a simple reparameterization, we can ensure that an arbitrary optimizer converges to the minimum l2-norm solution with favourable generalization properties. For under-parameterized linear classification, optimizers can converge to different decision boundaries separating the data. We prove that for any such classifier, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data can result in better generalization. We validate our theoretical results via experiments on synthetic and real datasets. |
Sharan Vaswani 🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Break (gather.town)
(Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
🔗 |
Fri 9:00 a.m. - 9:20 a.m.
|
Invited speaker: SGD without replacement: optimal rate analysis and more, Suvrit Sra
(Talk)
SlidesLive Video » Stochastic gradient descent (SGD) is the workhorse of machine learning. There are two fundamental versions of SGD: (i) those that pick stochastic gradients with replacement, and (ii) those that pick without replacement. Ironically, version (ii) is what is used in practice (across most ML toolkits), while version (i) is what almost all published work analyzes. This mismatch is well-known. It arises because analyzing SGD without replacement involves biased gradients and must cope with lack of independence between the stochastic gradients used. In this talk, I will present recent progress on analyzing without replacement SGD, the bulk of which will focus on minimax optimal convergence rates. The rates are obtained without assuming componentwise convexity. I will mention further refinements of the results assuming this additional convexity, which remove drawbacks common to previous works (such as large number of epochs required) |
Suvrit Sra 🔗 |
Fri 9:20 a.m. - 9:30 a.m.
|
Live Q&A with Suvrit Sra (Zoom)
(Q&A)
|
Martin Takac 🔗 |
Fri 9:45 a.m. - 10:50 a.m.
|
Poster Session 2 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
Sharan Vaswani · Nicolas Loizou · Wenjie Li · Preetum Nakkiran · Zhan Gao · Sina Baghal · Jingfeng Wu · Roozbeh Yousefzadeh · Jinyi Wang · Jing Wang · Cong Xie · Anastasia Borovykh · Stanislaw Jastrzebski · Soham Dan · Yiliang Zhang · Mark Tuddenham · Sarath Pattathil · Ievgen Redko · Jeremy Cohen · Yasaman Esfandiari · Zhanhong Jiang · Mostafa ElAraby · Chulhee Yun · Michael Psenka · Robert Gower · Xiaoyu Wang
|
Fri 10:50 a.m. - 11:00 a.m.
|
Welcome remarks to Session 3
(Opening remarks)
|
Mark Schmidt 🔗 |
Fri 11:00 a.m. - 11:20 a.m.
|
Invited speaker: Stochastic Geodesic Optimization, Ashia Wilson
(Talk)
Geodesic convexity offers a promising systematic way to handle non-convexity for many problems of interest in statistics and computer science. The focus of this talk will be to describe efforts to extend the basic tools of convex optimization on Euclidean space to the general setting of Riemannian manifolds. We begin by motivating our focus on geodesic optimization with several examples, reviewing the basics of geodesic spaces and several techniques from optimization along the way. Particular attention will be given to optimization techniques which achieve oracle lower bounds for minimizing stochastic functions, namely accelerated methods. We end with a discussion of how one might adapt these techniques to the Riemannian setting. |
Ashia Wilson 🔗 |
Fri 11:20 a.m. - 11:30 a.m.
|
Live Q&A with Ashia Wilson (Zoom)
(Q&A)
|
Mark Schmidt 🔗 |
Fri 11:30 a.m. - 11:50 a.m.
|
Invited speaker: Concentration for matrix products, and convergence of Oja’s algorithm for streaming PCA, Rachel Ward
(Talk)
We present new nonasymptotic growth and concentration bounds for a product of independent random matrices, similar in spirit to concentration for sums of independent random matrices developed in the previous decade. Our matrix product concentration bounds provide a new, direct convergence proof of Oja's algorithm for streaming Principal Component Analysis, and should be useful more broadly for analyzing the convergence of stochastic gradient descent for certain classes of nonconvex optimization problems, including neural networks. This talk covers joint work with Amelia Henriksen, De Huang, Jon Niles-Weed, and Joel Tropp. |
Rachel Ward 🔗 |
Fri 11:50 a.m. - 12:00 p.m.
|
Live Q&A with Rachel Ward (Zoom)
(Q&A)
|
Mark Schmidt 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed talks in Session 3 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Zhan Gao, "Incremental Greedy BFGS: An Incremental Quasi-Newton Method with Explicit Superlinear Rate" Wenjie Li, "Variance Reduction on Adaptive Stochastic Mirror Descent" Preetum Nakkiran, "Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems" Denny Wu, "When Does Preconditioning Help or Hurt Generalization?" Chengrun Yang, "TenIPS: Inverse Propensity Sampling for Tensor Completion" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. |
Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed Video: Variance Reduction on Adaptive Stochastic Mirror Descent, Wenjie Li
(Talk)
link »
SlidesLive Video » We study the application of the variance reduction technique on general adaptive stochastic mirror descent algorithms in nonsmooth nonconvex optimization problems. We prove that variance reduction helps to reduce the gradient complexity of most general stochastic mirror descent algorithms, so it works well with time-varying steps sizes and adaptive optimization algorithms such as AdaGrad. We check the validity of our claims using experiments in deep learning. |
Wenjie Li 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed Video: Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems, Preetum Nakkiran
(Talk)
link »
SlidesLive Video » Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li et al. (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this work, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping. |
Preetum Nakkiran 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed Video: When Does Preconditioning Help or Hurt Generalization?, Denny Wu
(Talk)
link »
SlidesLive Video »
While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the \textit{implicit bias} of optimizers affects the comparison of generalization properties. We provide an exact bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $\boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $\boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on label noise and ``shape'' of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned, NGD can achieve lower risk; conversely, GD generalizes better under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss approaches to manage the bias-variance tradeoff, and the benefit of interpolating between first- and second-order updates. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network, and observe robust trends matching our theoretical analysis.
|
Denny Wu 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed Video: Incremental Greedy BFGS: An Incremental Quasi-Newton Method with Explicit Superlinear Rate, Zhan Gao
(Talk)
link »
SlidesLive Video » Finite-sum minimization, i.e., problems where the objective may be written as the sum over a collection of instantaneous costs, are ubiquitous in modern machine learning and data science. Efficient numerical techniques for their solution must trade off per-step complexity with the number of steps required for convergence. Incremental Quasi-Newton methods (IQN) achieve a favorable balance of these competing attributes in the sense that their complexity is independent of the sample size, while their convergence rate can be faster than linear. This local superlinear behavior, to date, however, is known only asymptotically. In this work, we put forth a new variant of IQN, specifically of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) type, that incorporates a greedy basis vector selection step, and admits a non-asymptotic explicit local superlinear rate. To the best of our knowledge, this is the first time an explicit superlinear rate has been given for Quasi-Newton methods in the incremental setting. |
Zhan Gao 🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Contributed Video: TenIPS: Inverse Propensity Sampling for Tensor Completion, Chengrun Yang
(Talk)
link »
SlidesLive Video » Tensors are widely used to model relationships among objects in a high-dimensional space. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a randomized linear algebra approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. |
Chengrun Yang 🔗 |
Fri 12:30 p.m. - 1:30 p.m.
|
Break (gather.town)
(Break)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
🔗 |
Fri 1:30 p.m. - 1:50 p.m.
|
Invited speaker: Fast convergence of stochastic subgradient method under interpolation, Michael Friedlander
(Talk)
SlidesLive Video » This paper studies the behaviour of the stochastic subgradient descent (SSGD) method applied to over-parameterized empirical-risk optimization models that exactly fit the training data. We prove that for models with composite structures often found in neural networks, the interpolation condition implies that the model is effectively smooth at all minimizers, and therefore that SSGD converges at rates normally achievable only for smooth convex problems. We also prove that the fast rates we derive are optimal for any subgradient method applied to convex problems where interpolation holds. This is joint work with Huang Fang and Zhenan Fan. |
Michael Friedlander 🔗 |
Fri 1:30 p.m. - 1:35 p.m.
|
Intro to Invited Speaker 8
(Organizer intro)
|
Mark Schmidt 🔗 |
Fri 1:50 p.m. - 2:00 p.m.
|
Live Q&A with Michael Friedlander (Zoom)
(Q&A)
|
Mark Schmidt 🔗 |
Fri 2:00 p.m. - 2:50 p.m.
|
Poster Session 3 (gather.town)
(Poster session)
link »
Please join us in gather.town for all breaks and poster sessions. Click on "Open Link" to join gather.town. |
Denny Wu · Chengrun Yang · Tolga Ergen · sanae lotfi · Charles Guille-Escuret · Boris Ginsburg · Hanbake Lyu · Cong Xie · David Newton · Debraj Basu · Yewen Wang · James Lucas · MAOJIA LI · Lijun Ding · Jose Javier Gonzalez Ortiz · Reyhane Askari Hemmat · Zhiqi Bu · Neal Lawton · Kiran Thekumparampil · Jiaming Liang · Lindon Roberts · Jingyi Zhu · Dongruo Zhou
|
Fri 2:50 p.m. - 3:00 p.m.
|
Welcome remarks to Session 4
(Opening remarks)
|
Quanquan Gu 🔗 |
Fri 3:00 p.m. - 3:20 p.m.
|
Invited speaker: Online nonnegative matrix factorization for Markovian and other real data, Deanna Needell and Hanbaek Lyu
(Talk)
SlidesLive Video » Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a non-convex generalization of the well-known OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on real-world data and discuss recent extensions and variations. |
Hanbake Lyu · Deanna Needell 🔗 |
Fri 3:20 p.m. - 3:30 p.m.
|
Live Q&A with Deanna Needell and Hanbake Lyu (Zoom)
(Q&A)
|
Quanquan Gu 🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed talks in Session 4 (Zoom)
(Multiple talks)
Join us to hear some new, exciting work at the intersection of optimization and ML. Come and ask questions and join the discussion. Speakers: Tolga Ergen, "Convex Programs for Global Optimization of Convolutional Neural Networks in Polynomial-Time" Charles Guille-Escuret, "A Study of Condition Numbers for First-Order Optimization" Lewis Liu, "Affine-Invariant Analysis of Frank-Wolfe on Strongly Convex Sets" Sanae Lotfi, "Stochastic Damped L-BFGS with controlled norm of the Hessian approximation" Dongruo Zhou, "On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization" You can find a video on the NeurIPS website where the speakers discuss in detail their paper. |
Quanquan Gu · sanae lotfi · Charles Guille-Escuret · Tolga Ergen · Dongruo Zhou 🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed Video: Stochastic Damped L-BFGS with controlled norm of the Hessian approximation, Sanae Lotfi
(Talk)
link »
SlidesLive Video » We propose a new stochastic variance-reduced damped L-BFGS algorithm, where we leverage estimates of bounds on the largest and smallest eigenvalues of the Hessian approximation to balance its quality and conditioning. Our algorithm, VARCHEN, draws from previous work that proposed a novel stochastic damped L-BFGS algorithm called SdLBFGS. We establish almost sure converges to a stationary point and a complexity bound. We empirically demonstrate that VARCHEN is more robust than SdLBFGS-VR and SVRG on a modified DavidNet problem -- a highly nonconvex and ill-conditioned problem that arises in the context of deep learning, and their performance is comparable on a logistic regression problem and a nonconvex support-vector machine problem. |
sanae lotfi 🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed Video: A Study of Condition Numbers for First-Order Optimization, Charles Guille-Escuret
(Talk)
link »
SlidesLive Video » In this work we introduce a new framework for the theoretical study of convergence and tuning of first-order optimization algorithms (FOA). The study of such algorithms typically requires assumptions on the objective functions: the most popular ones are probably smoothness and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy.Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. |
Charles Guille-Escuret 🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed Video: Affine-Invariant Analysis of Frank-Wolfe on Strongly Convex Sets, Lewis Liu
(Talk)
link »
SlidesLive Video »
When the constraint set $\mathcal{C}$ is strongly convex, the Frank-Wolfe algorithm, which is affine co-variant, enjoys accelerated convergence rates. In contrast, existing results rely on norm-dependent assumptions, usually incurring non-affine invariant bounds. In this work, we introduce new structural assumptions on the problem and derive an affine invariant, norm-independent analysis of Frank Wolfe. Based on our analysis, we propose an affine-invariant backtracking line-search. Interestingly, we show that typical backtracking line-searches using smoothness of the objective function surprisingly converge to an affine-invariant stepsize, despite using affine-dependent norms in the computation of the stepsize.
|
🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed Video: Convex Programs for Global Optimization of Convolutional Neural Networks in Polynomial-Time, Tolga Ergen
(Talk)
link »
SlidesLive Video »
We study training of Convolutional Neural Networks (CNNs) with ReLU activations and introduce exact convex optimization formulations with a polynomial complexity with respect to the number of data samples, the number of neurons and data dimension. Particularly, we develop a convex analytic framework utilizing semi-infinite duality to obtain equivalent convex optimization problems for two-layer CNNs, where convex problems are regularized by the sum of $\ell_2$ norms of variables.
|
Tolga Ergen 🔗 |
Fri 3:30 p.m. - 4:00 p.m.
|
Contributed Video: On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization, Dongruo Zhou
(Talk)
link »
SlidesLive Video » Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension, and is strictly faster than stochastic gradient decent (SGD) when the stochastic gradients are sparse. To the best of our knowledge, this is the first result showing the advantage of adaptive gradient methods over SGD in nonconvex setting. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives. |
Dongruo Zhou 🔗 |
Fri 4:00 p.m. - 4:05 p.m.
|
Closing remarks
|
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac 🔗 |
Author Information
Courtney Paquette (Lehigh University)
Mark Schmidt (University of British Columbia)
Sebastian Stich (EPFL)
Dr. [Sebastian U. Stich](https://sstich.ch/) is a faculty at the CISPA Helmholtz Center for Information Security. Research interests: - *methods for machine learning and statistics*—at the interface of theory and practice - *collaborative learning* (distributed, federated and decentralized methods) - *optimization for machine learning* (adaptive stochastic methods and generalization performance)
Quanquan Gu (UCLA)
Martin Takac (Lehigh University)
More from the Same Authors
-
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization »
Difan Zou · Yuan Cao · Yuanzhi Li · Quanquan Gu -
2021 : Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt -
2021 : Heavy-tailed noise does not explain the gap between SGD and Adam on Transformers »
Jacques Chen · Frederik Kunstner · Mark Schmidt -
2021 : Random-reshuffled SARAH does not need a full gradient computations »
Aleksandr Beznosikov · Martin Takac -
2021 : Escaping Local Minima With Stochastic Noise »
Harshvardhan Harshvardhan · Sebastian Stich -
2021 : Faster Perturbed Stochastic Gradient Methods for Finding Local Minima »
Zixiang Chen · Dongruo Zhou · Quanquan Gu -
2021 : Faster Quasi-Newton Methods for Linear Composition Problems »
Betty Shea · Mark Schmidt -
2021 : A Closer Look at Gradient Estimators with Reinforcement Learning as Inference »
Jonathan Lavington · Michael Teng · Mark Schmidt · Frank Wood -
2021 : An Empirical Study of Non-Uniform Sampling in Off-Policy Reinforcement Learning for Continuous Control »
Nicholas Ioannidis · Jonathan Lavington · Mark Schmidt -
2021 : Learning Two-Player Mixture Markov Games: Kernel Function Approximation and Correlated Equilibrium »
Chris Junchi Li · Dongruo Zhou · Quanquan Gu · Michael Jordan -
2021 : The Peril of Popular Deep Learning Uncertainty Estimation Methods »
Yehao Liu · Matteo Pagliardini · Tatjana Chavdarova · Sebastian Stich -
2022 : Target-based Surrogates for Stochastic Optimization »
Jonathan Lavington · Sharan Vaswani · Reza Babanezhad Harikandeh · Mark Schmidt · Nicolas Le Roux -
2022 : Fast Convergence of Greedy 2-Coordinate Updates for Optimizing with an Equality Constraint »
Amrutha Varshini Ramesh · Aaron Mishkin · Mark Schmidt -
2022 : Fast Convergence of Random Reshuffling under Interpolation and the Polyak-Łojasiewicz Condition »
Chen Fan · Christos Thrampoulidis · Mark Schmidt -
2022 : Effects of momentum scaling for SGD »
Dmitry A. Pasechnyuk · Alexander Gasnikov · Martin Takac -
2022 : Using quadratic equations for overparametrized models »
Shuang Li · William Swartworth · Martin Takac · Deanna Needell · Robert Gower -
2022 : FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences »
Artem Agafonov · Brahim Erraji · Martin Takac -
2022 : Cubic Regularized Quasi-Newton Methods »
Dmitry Kamzolov · Klea Ziu · Artem Agafonov · Martin Takac -
2022 : PSPS: Preconditioned Stochastic Polyak Step-size method for badly scaled data »
Farshed Abdukhakimov · Chulu Xiang · Dmitry Kamzolov · Robert Gower · Martin Takac -
2022 : Practical Structured Riemannian Optimization with Momentum by using Generalized Normal Coordinates »
Wu Lin · Valentin Duruisseaux · Melvin Leok · Frank Nielsen · Mohammad Emtiyaz Khan · Mark Schmidt -
2022 : A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning »
Zixiang Chen · Chris Junchi Li · Angela Yuan · Quanquan Gu · Michael Jordan -
2022 Spotlight: Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Sham Kakade -
2022 : Contributed Talks 2 »
Quanquan Gu · Aaron Defazio · Jiajin Li -
2022 Workshop: OPT 2022: Optimization for Machine Learning »
Courtney Paquette · Sebastian Stich · Quanquan Gu · Cristóbal Guzmán · John Duchi -
2022 Workshop: Order up! The Benefits of Higher-Order Optimization in Machine Learning »
Albert Berahas · Jelena Diakonikolas · Jarad Forristal · Brandon Reese · Martin Takac · Yan Xu -
2022 Poster: Towards Understanding the Mixture-of-Experts Layer in Deep Learning »
Zixiang Chen · Yihe Deng · Yue Wu · Quanquan Gu · Yuanzhi Li -
2022 Poster: Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs »
Dongruo Zhou · Quanquan Gu -
2022 Poster: Benign Overfitting in Two-layer Convolutional Neural Networks »
Yuan Cao · Zixiang Chen · Misha Belkin · Quanquan Gu -
2022 Poster: Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions »
Courtney Paquette · Elliot Paquette · Ben Adlam · Jeffrey Pennington -
2022 Poster: Learning Two-Player Markov Games: Neural Function Approximation and Correlated Equilibrium »
Chris Junchi Li · Dongruo Zhou · Quanquan Gu · Michael Jordan -
2022 Poster: A Simple and Provably Efficient Algorithm for Asynchronous Federated Contextual Linear Bandits »
Jiafan He · Tianhao Wang · Yifei Min · Quanquan Gu -
2022 Poster: The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift »
Jingfeng Wu · Difan Zou · Vladimir Braverman · Quanquan Gu · Sham Kakade -
2022 Poster: Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Sham Kakade -
2022 Poster: A Damped Newton Method Achieves Global $\mathcal O \left(\frac{1}{k^2}\right)$ and Local Quadratic Convergence Rate »
Slavomír Hanzely · Dmitry Kamzolov · Dmitry Pasechnyuk · Alexander Gasnikov · Peter Richtarik · Martin Takac -
2022 Poster: Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions »
Jiafan He · Dongruo Zhou · Tong Zhang · Quanquan Gu -
2022 Poster: Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions »
Kiwon Lee · Andrew Cheng · Elliot Paquette · Courtney Paquette -
2022 Poster: Active Ranking without Strong Stochastic Transitivity »
Hao Lou · Tao Jin · Yue Wu · Pan Xu · Quanquan Gu · Farzad Farnoud -
2021 : Contributed talks in Session 4 (Zoom) »
Quanquan Gu · Agnieszka Słowik · Jacques Chen · Neha Wadia · Difan Zou -
2021 : Opening Remarks to Session 4 »
Quanquan Gu -
2021 : Contributed Talks in Session 1 (Zoom) »
Sebastian Stich · Futong Liu · Abdurakhmon Sadiev · Frederik Benzing · Simon Roburin -
2021 : Opening Remarks to Session 1 »
Sebastian Stich -
2021 Workshop: OPT 2021: Optimization for Machine Learning »
Courtney Paquette · Quanquan Gu · Oliver Hinder · Katya Scheinberg · Sebastian Stich · Martin Takac -
2021 Poster: The Benefits of Implicit Regularization from SGD in Least Squares Problems »
Difan Zou · Jingfeng Wu · Vladimir Braverman · Quanquan Gu · Dean Foster · Sham Kakade -
2021 Poster: Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation »
Jiafan He · Dongruo Zhou · Quanquan Gu -
2021 Poster: Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent »
Spencer Frei · Quanquan Gu -
2021 Poster: Breaking the centralized barrier for cross-device federated learning »
Sai Praneeth Karimireddy · Martin Jaggi · Satyen Kale · Mehryar Mohri · Sashank Reddi · Sebastian Stich · Ananda Theertha Suresh -
2021 Poster: Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures »
Yuan Cao · Quanquan Gu · Mikhail Belkin -
2021 Poster: Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs »
Jiafan He · Dongruo Zhou · Quanquan Gu -
2021 Poster: Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation »
Weitong ZHANG · Dongruo Zhou · Quanquan Gu -
2021 Poster: Variance-Aware Off-Policy Evaluation with Linear Function Approximation »
Yifei Min · Tianhao Wang · Dongruo Zhou · Quanquan Gu -
2021 Poster: RelaySum for Decentralized Deep Learning on Heterogeneous Data »
Thijs Vogels · Lie He · Anastasiia Koloskova · Sai Praneeth Karimireddy · Tao Lin · Sebastian Stich · Martin Jaggi -
2021 Poster: Iterative Teacher-Aware Learning »
Luyao Yuan · Dongruo Zhou · Junhong Shen · Jingdong Gao · Jeffrey L Chen · Quanquan Gu · Ying Nian Wu · Song-Chun Zhu -
2021 Poster: An Improved Analysis of Gradient Tracking for Decentralized Machine Learning »
Anastasiia Koloskova · Tao Lin · Sebastian Stich -
2021 Poster: Provably Efficient Reinforcement Learning with Linear Function Approximation under Adaptivity Constraints »
Tianhao Wang · Dongruo Zhou · Quanquan Gu -
2021 Poster: Exploring Architectural Ingredients of Adversarially Robust Deep Neural Networks »
Hanxun Huang · Yisen Wang · Sarah Erfani · Quanquan Gu · James Bailey · Xingjun Ma -
2021 Poster: Do Wider Neural Networks Really Help Adversarial Robustness? »
Boxi Wu · Jinghui Chen · Deng Cai · Xiaofei He · Quanquan Gu -
2021 Poster: Pure Exploration in Kernel and Neural Bandits »
Yinglun Zhu · Dongruo Zhou · Ruoxi Jiang · Quanquan Gu · Rebecca Willett · Robert Nowak -
2021 Poster: Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models »
Courtney Paquette · Elliot Paquette -
2020 : Closing remarks »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac -
2020 : Contributed talks in Session 4 (Zoom) »
Quanquan Gu · sanae lotfi · Charles Guille-Escuret · Tolga Ergen · Dongruo Zhou -
2020 : Live Q&A with Deanna Needell and Hanbake Lyu (Zoom) »
Quanquan Gu -
2020 : Welcome remarks to Session 4 »
Quanquan Gu -
2020 : Live Q&A with Michael Friedlander (Zoom) »
Mark Schmidt -
2020 : Intro to Invited Speaker 8 »
Mark Schmidt -
2020 : Contributed talks in Session 3 (Zoom) »
Mark Schmidt · Zhan Gao · Wenjie Li · Preetum Nakkiran · Denny Wu · Chengrun Yang -
2020 : Live Q&A with Rachel Ward (Zoom) »
Mark Schmidt -
2020 : Live Q&A with Ashia Wilson (Zoom) »
Mark Schmidt -
2020 : Welcome remarks to Session 3 »
Mark Schmidt -
2020 : Live Q&A with Suvrit Sra (Zoom) »
Martin Takac -
2020 : Intro to Invited Speaker 5 »
Martin Takac -
2020 : Contributed talks in Session 2 (Zoom) »
Martin Takac · Samuel Horváth · Guan-Horng Liu · Nicolas Loizou · Sharan Vaswani -
2020 : Live Q&A with Donald Goldfarb (Zoom) »
Martin Takac -
2020 : Live Q&A with Andreas Krause (Zoom) »
Martin Takac -
2020 : Welcome remarks to Session 2 »
Martin Takac -
2020 : Contributed talks in Session 1 (Zoom) »
Sebastian Stich · Laurent Condat · Zhize Li · Ohad Shamir · Tiffany Vlaar · Mohammadi Zaki -
2020 : Live Q&A with Volkan Cevher (Zoom) »
Sebastian Stich -
2020 : Live Q&A with Tong Zhang (Zoom) »
Sebastian Stich -
2020 : Welcome remarks to Session 1 »
Sebastian Stich -
2020 : Welcome event (gather.town) »
Quanquan Gu · Courtney Paquette · Mark Schmidt · Sebastian Stich · Martin Takac -
2020 Poster: A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks »
Zixiang Chen · Yuan Cao · Quanquan Gu · Tong Zhang -
2020 Poster: Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses »
Yihan Zhou · Victor Sanches Portella · Mark Schmidt · Nicholas Harvey -
2020 Poster: Ensemble Distillation for Robust Model Fusion in Federated Learning »
Tao Lin · Lingjing Kong · Sebastian Stich · Martin Jaggi -
2020 Poster: Agnostic Learning of a Single Neuron with Gradient Descent »
Spencer Frei · Yuan Cao · Quanquan Gu -
2020 Poster: A Finite-Time Analysis of Two Time-Scale Actor-Critic Methods »
Yue Wu · Weitong ZHANG · Pan Xu · Quanquan Gu -
2019 : Poster and Coffee Break 2 »
Karol Hausman · Kefan Dong · Ken Goldberg · Lihong Li · Lin Yang · Lingxiao Wang · Lior Shani · Liwei Wang · Loren Amdahl-Culleton · Lucas Cassano · Marc Dymetman · Marc Bellemare · Marcin Tomczak · Margarita Castro · Marius Kloft · Marius-Constantin Dinu · Markus Holzleitner · Martha White · Mengdi Wang · Michael Jordan · Mihailo Jovanovic · Ming Yu · Minshuo Chen · Moonkyung Ryu · Muhammad Zaheer · Naman Agarwal · Nan Jiang · Niao He · Nikolaus Yasui · Nikos Karampatziakis · Nino Vieillard · Ofir Nachum · Olivier Pietquin · Ozan Sener · Pan Xu · Parameswaran Kamalaruban · Paul Mineiro · Paul Rolland · Philip Amortila · Pierre-Luc Bacon · Prakash Panangaden · Qi Cai · Qiang Liu · Quanquan Gu · Raihan Seraj · Richard Sutton · Rick Valenzano · Robert Dadashi · Rodrigo Toro Icarte · Roshan Shariff · Roy Fox · Ruosong Wang · Saeed Ghadimi · Samuel Sokota · Sean Sinclair · Sepp Hochreiter · Sergey Levine · Sergio Valcarcel Macua · Sham Kakade · Shangtong Zhang · Sheila McIlraith · Shie Mannor · Shimon Whiteson · Shuai Li · Shuang Qiu · Wai Lok Li · Siddhartha Banerjee · Sitao Luan · Tamer Basar · Thinh Doan · Tianhe Yu · Tianyi Liu · Tom Zahavy · Toryn Klassen · Tuo Zhao · Vicenç Gómez · Vincent Liu · Volkan Cevher · Wesley Suttle · Xiao-Wen Chang · Xiaohan Wei · Xiaotong Liu · Xingguo Li · Xinyi Chen · Xingyou Song · Yao Liu · YiDing Jiang · Yihao Feng · Yilun Du · Yinlam Chow · Yinyu Ye · Yishay Mansour · · Yonathan Efroni · Yongxin Chen · Yuanhao Wang · Bo Dai · Chen-Yu Wei · Harsh Shrivastava · Hongyang Zhang · Qinqing Zheng · SIDDHARTHA SATPATHI · Xueqing Liu · Andreu Vall -
2019 : Poster Session »
Eduard Gorbunov · Alexandre d'Aspremont · Lingxiao Wang · Liwei Wang · Boris Ginsburg · Alessio Quaglino · Camille Castera · Saurabh Adya · Diego Granziol · Rudrajit Das · Raghu Bollapragada · Fabian Pedregosa · Martin Takac · Majid Jahani · Sai Praneeth Karimireddy · Hilal Asi · Balint Daroczy · Leonard Adolphs · Aditya Rawal · Nicolas Brandt · Minhan Li · Giuseppe Ughi · Orlando Romero · Ivan Skorokhodov · Damien Scieur · Kiwook Bae · Konstantin Mishchenko · Rohan Anil · Vatsal Sharan · Aditya Balu · Chao Chen · Zhewei Yao · Tolga Ergen · Paul Grigas · Chris Junchi Li · Jimmy Ba · Stephen J Roberts · Sharan Vaswani · Armin Eftekhari · Chhavi Sharma -
2019 Poster: Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks »
Spencer Frei · Yuan Cao · Quanquan Gu -
2019 Poster: Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks »
Difan Zou · Ziniu Hu · Yewen Wang · Song Jiang · Yizhou Sun · Quanquan Gu -
2019 Poster: Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates »
Sharan Vaswani · Aaron Mishkin · Issam Laradji · Mark Schmidt · Gauthier Gidel · Simon Lacoste-Julien -
2019 Poster: Stochastic Gradient Hamiltonian Monte Carlo Methods with Recursive Variance Reduction »
Difan Zou · Pan Xu · Quanquan Gu -
2019 Poster: Tight Sample Complexity of Learning One-hidden-layer Convolutional Neural Networks »
Yuan Cao · Quanquan Gu -
2019 Poster: An Improved Analysis of Training Over-parameterized Deep Neural Networks »
Difan Zou · Quanquan Gu -
2019 Poster: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu -
2019 Spotlight: Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks »
Yuan Cao · Quanquan Gu -
2018 Poster: Third-order Smoothness Helps: Faster Stochastic Optimization Algorithms for Finding Local Minima »
Yaodong Yu · Pan Xu · Quanquan Gu -
2018 Poster: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu -
2018 Poster: SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient »
Aaron Mishkin · Frederik Kunstner · Didrik Nielsen · Mark Schmidt · Mohammad Emtiyaz Khan -
2018 Spotlight: Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization »
Pan Xu · Jinghui Chen · Difan Zou · Quanquan Gu -
2018 Poster: Accelerated Stochastic Matrix Inversion: General Theory and Speeding up BFGS Rules for Faster Second-Order Optimization »
Robert Gower · Filip Hanzely · Peter Richtarik · Sebastian Stich -
2018 Poster: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu -
2018 Poster: Reinforcement Learning for Solving the Vehicle Routing Problem »
MohammadReza Nazari · Afshin Oroojlooy · Lawrence Snyder · Martin Takac -
2018 Spotlight: Stochastic Nested Variance Reduced Gradient Descent for Nonconvex Optimization »
Dongruo Zhou · Pan Xu · Quanquan Gu -
2018 Poster: Sparsified SGD with Memory »
Sebastian Stich · Jean-Baptiste Cordonnier · Martin Jaggi -
2018 Poster: Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization »
Bargav Jayaraman · Lingxiao Wang · David Evans · Quanquan Gu -
2017 Poster: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi -
2017 Spotlight: Safe Adaptive Importance Sampling »
Sebastian Stich · Anant Raj · Martin Jaggi -
2017 Poster: Speeding Up Latent Variable Gaussian Graphical Model Estimation via Nonconvex Optimization »
Pan Xu · Jian Ma · Quanquan Gu -
2016 : Fast Patch-based Style Transfer of Arbitrary Style »
Tian Qi Chen · Mark Schmidt -
2016 Poster: Semiparametric Differential Graph Models »
Pan Xu · Quanquan Gu -
2016 Poster: A Multi-Batch L-BFGS Method for Machine Learning »
Albert Berahas · Jorge Nocedal · Martin Takac -
2015 Poster: High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality »
Zhaoran Wang · Quanquan Gu · Yang Ning · Han Liu -
2015 Poster: StopWasting My Gradients: Practical SVRG »
Reza Babanezhad Harikandeh · Mohamed Osama Ahmed · Alim Virani · Mark Schmidt · Jakub Konečný · Scott Sallinen -
2014 Poster: Communication-Efficient Distributed Dual Coordinate Ascent »
Martin Jaggi · Virginia Smith · Martin Takac · Jonathan Terhorst · Sanjay Krishnan · Thomas Hofmann · Michael Jordan -
2014 Poster: Sparse PCA with Oracle Property »
Quanquan Gu · Zhaoran Wang · Han Liu -
2014 Poster: Robust Tensor Decomposition with Gross Corruption »
Quanquan Gu · Huan Gui · Jiawei Han -
2012 Poster: Selective Labeling via Error Bound Minimization »
Quanquan Gu · Tong Zhang · Chris Ding · Jiawei Han