Timezone: »
Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i.e., shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.
Author Information
Chulhee Yun (MIT)
Suvrit Sra (MIT)
Suvrit Sra is a faculty member within the EECS department at MIT, where he is also a core faculty member of IDSS, LIDS, MIT-ML Group, as well as the statistics and data science center. His research spans topics in optimization, matrix theory, differential geometry, and probability theory, which he connects with machine learning --- a key focus of his research is on the theme "Optimization for Machine Learning” (http://opt-ml.org)
Ali Jadbabaie (MIT)
More from the Same Authors
-
2023 Poster: The Curious Role of Normalization in Sharpness-Aware Minimization »
Yan Dai · Kwangjun Ahn · Suvrit Sra -
2023 Poster: Convergence of Adam under Relaxed Assumptions »
Haochuan Li · Ali Jadbabaie · Alexander Rakhlin -
2023 Poster: Transformers learn to implement preconditioned gradient descent for in-context learning »
Kwangjun Ahn · Xiang Cheng · Hadi Daneshmand · Suvrit Sra -
2023 Poster: Demystifying Oversmoothing in Attention-Based Graph Neural Networks »
Xinyi Wu · Amir Ajorlou · Zihui Wu · Ali Jadbabaie -
2023 Poster: Beyond Lipschitz Smoothness: A New Approach to Convex and Non-Convex Optimization »
Haochuan Li · Jian Qian · Yi Tian · Ali Jadbabaie · Alexander Rakhlin -
2022 Poster: CCCP is Frank-Wolfe in disguise »
Alp Yurtsever · Suvrit Sra -
2022 Poster: Efficient Sampling on Riemannian Manifolds via Langevin MCMC »
Xiang Cheng · Jingzhao Zhang · Suvrit Sra -
2021 Poster: Can contrastive learning avoid shortcut solutions? »
Joshua Robinson · Li Sun · Ke Yu · Kayhan Batmanghelich · Stefanie Jegelka · Suvrit Sra -
2021 Poster: Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization »
Haochuan Li · Yi Tian · Jingzhao Zhang · Ali Jadbabaie -
2021 Poster: Three Operator Splitting with Subgradients, Stochastic Gradients, and Adaptive Learning Rates »
Alp Yurtsever · Alex Gu · Suvrit Sra -
2020 : Poster Session 2 (gather.town) »
Sharan Vaswani · Nicolas Loizou · Wenjie Li · Preetum Nakkiran · Zhan Gao · Sina Baghal · Jingfeng Wu · Roozbeh Yousefzadeh · Jinyi Wang · Jing Wang · Cong Xie · Anastasia Borovykh · Stanislaw Jastrzebski · Soham Dan · Yiliang Zhang · Mark Tuddenham · Sarath Pattathil · Ievgen Redko · Jeremy Cohen · Yasaman Esfandiari · Zhanhong Jiang · Mostafa ElAraby · Chulhee Yun · Michael Psenka · Robert Gower · Xiaoyu Wang -
2020 : Invited speaker: SGD without replacement: optimal rate analysis and more, Suvrit Sra »
Suvrit Sra -
2020 Poster: SGD with shuffling: optimal rates without component convexity and large epoch requirements »
Kwangjun Ahn · Chulhee Yun · Suvrit Sra -
2020 Spotlight: SGD with shuffling: optimal rates without component convexity and large epoch requirements »
Kwangjun Ahn · Chulhee Yun · Suvrit Sra -
2020 Poster: Robust Federated Learning: The Case of Affine Distribution Shifts »
Amirhossein Reisizadeh · Farzan Farnia · Ramtin Pedarsani · Ali Jadbabaie -
2020 Poster: Estimation of Skill Distribution from a Tournament »
Ali Jadbabaie · Anuran Makur · Devavrat Shah -
2020 Spotlight: Estimation of Skill Distribution from a Tournament »
Ali Jadbabaie · Anuran Makur · Devavrat Shah -
2020 Poster: Why are Adaptive Methods Good for Attention Models? »
Jingzhao Zhang · Sai Praneeth Karimireddy · Andreas Veit · Seungyeon Kim · Sashank Reddi · Sanjiv Kumar · Suvrit Sra -
2020 Poster: O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers »
Chulhee Yun · Yin-Wen Chang · Srinadh Bhojanapalli · Ankit Singh Rawat · Sashank Reddi · Sanjiv Kumar -
2020 Poster: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes »
Yi Tian · Jian Qian · Suvrit Sra -
2020 Spotlight: Towards Minimax Optimal Reinforcement Learning in Factored Markov Decision Processes »
Yi Tian · Jian Qian · Suvrit Sra -
2019 : Break / Poster Session 1 »
Antonia Marcu · Yao-Yuan Yang · Pascale Gourdeau · Chen Zhu · Thodoris Lykouris · Jianfeng Chi · Mark Kozdoba · Arjun Nitin Bhagoji · Xiaoxia Wu · Jay Nandy · Michael T Smith · Bingyang Wen · Yuege Xie · Konstantinos Pitas · Suprosanna Shit · Maksym Andriushchenko · Dingli Yu · Gaël Letarte · Misha Khodak · Hussein Mozannar · Chara Podimata · James Foulds · Yizhen Wang · Huishuai Zhang · Ondrej Kuzelka · Alexander Levine · Nan Lu · Zakaria Mhammedi · Paul Viallard · Diana Cai · Lovedeep Gondara · James Lucas · Yasaman Mahdaviyeh · Aristide Baratin · Rishi Bommasani · Alessandro Barp · Andrew Ilyas · Kaiwen Wu · Jens Behrmann · Omar Rivasplata · Amir Nazemi · Aditi Raghunathan · Will Stephenson · Sahil Singla · Akhil Gupta · YooJung Choi · Yannic Kilcher · Clare Lyle · Edoardo Manino · Andrew Bennett · Zhi Xu · Niladri Chatterji · Emre Barut · Flavien Prost · Rodrigo Toro Icarte · Arno Blaas · Chulhee Yun · Sahin Lale · YiDing Jiang · Tharun Kumar Reddy Medini · Ashkan Rezaei · Alexander Meinke · Stephen Mell · Gary Kazantsev · Shivam Garg · Aradhana Sinha · Vishnu Lokhande · Geovani Rizk · Han Zhao · Aditya Kumar Akash · Jikai Hou · Ali Ghodsi · Matthias Hein · Tyler Sypherd · Yichen Yang · Anastasia Pentina · Pierre Gillot · Antoine Ledent · Guy Gur-Ari · Noah MacAulay · Tianzong Zhang -
2019 Poster: Flexible Modeling of Diversity with Strongly Log-Concave Distributions »
Joshua Robinson · Suvrit Sra · Stefanie Jegelka -
2019 Poster: Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity »
Chulhee Yun · Suvrit Sra · Ali Jadbabaie -
2019 Spotlight: Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity »
Chulhee Yun · Suvrit Sra · Ali Jadbabaie -
2018 Poster: Direct Runge-Kutta Discretization Achieves Acceleration »
Jingzhao Zhang · Aryan Mokhtari · Suvrit Sra · Ali Jadbabaie -
2018 Spotlight: Direct Runge-Kutta Discretization Achieves Acceleration »
Jingzhao Zhang · Aryan Mokhtari · Suvrit Sra · Ali Jadbabaie -
2018 Poster: Escaping Saddle Points in Constrained Optimization »
Aryan Mokhtari · Asuman Ozdaglar · Ali Jadbabaie -
2018 Spotlight: Escaping Saddle Points in Constrained Optimization »
Aryan Mokhtari · Asuman Ozdaglar · Ali Jadbabaie -
2018 Poster: Exponentiated Strongly Rayleigh Distributions »
Zelda Mariet · Suvrit Sra · Stefanie Jegelka -
2018 Tutorial: Negative Dependence, Stable Polynomials, and All That »
Suvrit Sra · Stefanie Jegelka -
2017 : Poster Session 1 and Lunch »
Sumanth Dathathri · Akshay Rangamani · Prakhar Sharma · Aruni RoyChowdhury · Madhu Advani · William Guss · Chulhee Yun · Corentin Hardy · Michele Alberti · Devendra Sachan · Andreas Veit · Takashi Shinozaki · Peter Chin -
2017 Workshop: OPT 2017: Optimization for Machine Learning »
Suvrit Sra · Sashank J. Reddi · Alekh Agarwal · Benjamin Recht -
2017 Poster: Elementary Symmetric Polynomials for Optimal Experimental Design »
Zelda Mariet · Suvrit Sra -
2017 Poster: Polynomial time algorithms for dual volume sampling »
Chengtao Li · Stefanie Jegelka · Suvrit Sra -
2016 Workshop: OPT 2016: Optimization for Machine Learning »
Suvrit Sra · Francis Bach · Sashank J. Reddi · Niao He -
2016 : Taming non-convexity via geometry »
Suvrit Sra -
2016 Poster: Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling »
Chengtao Li · Suvrit Sra · Stefanie Jegelka -
2016 Poster: Kronecker Determinantal Point Processes »
Zelda Mariet · Suvrit Sra -
2016 Poster: Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization »
Sashank J. Reddi · Suvrit Sra · Barnabas Poczos · Alexander Smola -
2016 Poster: Riemannian SVRG: Fast Stochastic Optimization on Riemannian Manifolds »
Hongyi Zhang · Sashank J. Reddi · Suvrit Sra -
2016 Tutorial: Large-Scale Optimization: Beyond Stochastic Gradient Descent and Convexity »
Suvrit Sra · Francis Bach -
2015 Workshop: Optimization for Machine Learning (OPT2015) »
Suvrit Sra · Alekh Agarwal · Leon Bottou · Sashank J. Reddi -
2015 Poster: Matrix Manifold Optimization for Gaussian Mixtures »
Reshad Hosseini · Suvrit Sra -
2015 Poster: On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants »
Sashank J. Reddi · Ahmed Hefny · Suvrit Sra · Barnabas Poczos · Alexander Smola -
2014 Workshop: OPT2014: Optimization for Machine Learning »
Zaid Harchaoui · Suvrit Sra · Alekh Agarwal · Martin Jaggi · Miro Dudik · Aaditya Ramdas · Jean Lasserre · Yoshua Bengio · Amir Beck -
2014 Poster: Efficient Structured Matrix Rank Minimization »
Adams Wei Yu · Wanli Ma · Yaoliang Yu · Jaime Carbonell · Suvrit Sra -
2013 Workshop: OPT2013: Optimization for Machine Learning »
Suvrit Sra · Alekh Agarwal -
2013 Poster: Geometric optimisation on positive definite matrices for elliptically contoured distributions »
Suvrit Sra · Reshad Hosseini -
2013 Poster: Reflection methods for user-friendly submodular optimization »
Stefanie Jegelka · Francis Bach · Suvrit Sra -
2012 Workshop: Optimization for Machine Learning »
Suvrit Sra · Alekh Agarwal -
2012 Poster: A new metric on the manifold of kernel matrices with application to matrix geometric means »
Suvrit Sra -
2012 Poster: Scalable nonconvex inexact proximal splitting »
Suvrit Sra -
2011 Workshop: Optimization for Machine Learning »
Suvrit Sra · Stephen Wright · Sebastian Nowozin -
2010 Workshop: Numerical Mathematics Challenges in Machine Learning »
Matthias Seeger · Suvrit Sra -
2010 Workshop: Optimization for Machine Learning »
Suvrit Sra · Sebastian Nowozin · Stephen Wright -
2009 Workshop: Optimization for Machine Learning »
Sebastian Nowozin · Suvrit Sra · S.V.N Vishwanthan · Stephen Wright -
2008 Workshop: Optimization for Machine Learning »
Suvrit Sra · Sebastian Nowozin · Vishwanathan S V N