Workshop
Heavy Tails in ML: Structure, Stability, Dynamics
Mert Gurbuzbalaban · Stefanie Jegelka · Michael Mahoney · Umut Simsekli
Room R02-R05 (level 2)
Heavy-tails and chaotic behavior naturally appear in many ways in ML. We aim to understand how they emerge and how they affect the properties of ML methods.
Schedule
Fri 7:00 a.m. - 7:01 a.m.
|
Opening Remarks
(
Presentation
)
>
|
🔗 |
Fri 7:00 a.m. - 8:00 a.m.
|
Keynote: Adam Wierman
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 9:10 a.m.
|
Invited talk: Liam Hodgkinson
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:10 a.m. - 9:35 a.m.
|
Contributed talk: Vivien Cabannes
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 9:35 a.m. - 10:00 a.m.
|
Contributed talk: Dominique Perrault-Joncas
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 10:00 a.m. - 12:00 p.m.
|
Lunch break
(
Break
)
>
|
🔗 |
Fri 12:00 p.m. - 12:40 p.m.
|
Invited talk: Nisha Chandramoorthy
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 12:40 p.m. - 1:05 p.m.
|
Contributed talk: Jeremy Cohen
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 1:05 p.m. - 1:30 p.m.
|
Coffee break
(
Break
)
>
|
🔗 |
Fri 1:30 p.m. - 2:10 p.m.
|
Invited talk: Charles Martin
(
Presentation
)
>
SlidesLive Video |
🔗 |
Fri 2:10 p.m. - 3:30 p.m.
|
Poster session
(
Poster presentation
)
>
SlidesLive Video |
🔗 |
Fri 3:29 p.m. - 3:30 p.m.
|
Closing remarks
(
Presentation
)
>
|
🔗 |
-
|
Semi-Implicit Neural Ordinary Differential Equations for Learning Chaotic Systems
(
Poster
)
>
link
Classical neural ordinary differential equations (ODEs) trained by using explicit methods are intrinsically constrained by stability,severely affecting their efficiency and robustness in learning complex spatiotemporal dynamics, particularly those displaying chaotic behavior. In this work we propose a semi-implicit neural ODE approach that capitalizes on the partitionable structure of the underlying dynamics.In our method the neural ODE is partitioned into a linear part treated implicitly for enhanced stability and a nonlinear part treated explicitly.We apply this approach to learn chaotic trajectories of the Kuramoto--Sivashinsky equation.Our results demonstrate that our approach significantly outperforms existing approaches for coarse-resolution data and remains efficient for fine-resolution data where existing techniques become intractable. |
Hong Zhang · Ying Liu · Romit Maulik 🔗 |
-
|
From Mutual Information to Expected Dynamics: New Generalization Bounds for Heavy-Tailed SGD
(
Poster
)
>
link
Understanding the generalization abilities of modern machine learning algorithms has been a major research topic over the past decades. In recent years, the learning dynamics of Stochastic Gradient Descent (SGD) have been related to heavy-tailed dynamics. This has been successfully applied to generalization theory by exploiting the fractal properties of those dynamics. However, the derived bounds depend on mutual information (decoupling) terms that are beyond the reach of computability. In this work, we prove generalization bounds over the trajectory of a class of heavy-tailed dynamics, without those mutual information terms. Instead, we introduce a geometric decoupling term by comparing the learning dynamics (depending on the empirical risk) with an expected one (depending on the population risk). We further upper-bound this geometric term, by using techniques from the heavy-tailed and the fractal literature, making it fully computable. Moreover, as an attempt to tighten the bounds, we propose a PAC-Bayesian setting based on perturbed dynamics, in which the same geometric term plays a crucial role and can still be bounded using the techniques described above. |
Benjamin, Bernard, Louis DUPUIS · Paul Viallard 🔗 |
-
|
Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection
(
Poster
)
>
link
We propose an embarrassingly simple method -- instance-aware repeat factor sampling (IRFS) to address the problem of imbalanced data in long-tailed object detection. Imbalanced datasets in real-world object detection often suffer from a large disparity in the number of instances for each class. To improve the generalization performance of object detection models on rare classes, various data sampling techniques have been proposed. Repeat factor sampling (RFS) has shown promise due to its simplicity and effectiveness. Despite its efficiency, RFS completely neglects the instance counts and solely relies on the image count during re-sampling process. However, instance count may immensely vary for different classes with similar image counts. Such variation highlights the importance of both image and instance for addressing the long-tail distributions. Thus, we propose IRFS which unifies instance and image counts for the re-sampling process to be aware of different perspectives of the imbalance in long-tailed datasets. Our method shows promising results on the challenging LVIS v1.0 benchmark dataset over various architectures and backbones, demonstrating their effectiveness in improving the performance of object detection models on rare classes with a relative $+50\%$ average precision (AP) improvement over counterpart RFS. IRFS can serve as a strong baseline and be easily incorporated into existing long-tailed frameworks.
|
Burhan Yaman · Tanvir Mahmud · Chun-Hao Liu 🔗 |
-
|
Meta-Analysis of Randomized Experiments with Applications to Heavy-Tailed Response Data
(
Poster
)
>
link
A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. In this paper, we propose a novel cross-validation-like methodology to address this challenge. The key insight of our procedure is that the noisy (but unbiased) difference-of-means estimate can be used as a ground truth ``label" on a portion of the RCT, to test the performance of an estimator trained on the other portion. We combine this insight with an aggregation scheme, which borrows statistical strength across a large collection of RCTs, to present an end-to-end methodology for judging an estimator's ability to recover the underlying treatment effect. We evaluate our methodology across 699 RCTs implemented in the Amazon supply chain. In this heavy-tailed setting, our methodology suggests that procedures that aggressively downweight or truncate large values, while introducing bias, lower the variance enough to ensure that the treatment effect is more accurately estimated. |
Nilesh Tripuraneni · Dominique Perrault-Joncas · Dhruv Madeka · Dean Foster · Michael Jordan 🔗 |
-
|
The Effects of Ensembling on Long-Tailed Data
(
Poster
)
>
link
Deep ensembles are a popular approach to improve over single model performance (Lakshminarayanan et al. 2017), either by averaging logits (Hinton et al. 2015, Webb et al. 2020, Gontijo-Lopes et al. 2022), or probabilities of multiple models (Dietterich 2000, Lakshminarayanan et al. 2017, Kumar et al. 2022). Recent theoretical work has shown that logit and probability ensembles have different benefits (Gupta et al. 2022, Wood et al. 2023), but to our knowledge these ensembling approaches have not been compared systematically for balanced vs imbalanced data. In this work, we show that for balanced datasets, there is no significant difference between logit and probability ensembles in terms of accuracy and ranked calibration. However, we show that in long tailed datasets, there are gains from logit ensembling when combined with imbalance bias reduction losses. In turn, our results suggest that there are benefits to be gained from loss-aware ensembles when dealing with long-tail data. |
Estefany Kelly Buchanan · Geoff Pleiss · John Cunningham 🔗 |
-
|
High-dimensional robust regression under heavy-tailed data: Asymptotics and Universality
(
Poster
)
>
link
We investigate the high-dimensional properties of robust regression estimators in the presence of heavy-tailed contamination of both the covariates and response functions. In particular, we provide a sharp asymptotic characterisation of M-estimators trained on a family of elliptical covariate and noise data distributions including cases where second and higher moments do not exist. We show that, despite being consistent, the Huber loss with optimally tuned location parameter $\delta$ is suboptimal in the high-dimensional regime in the presence of heavy-tailed noise, necessitating regularisation for optimal performance. This result also uncovers the existence of a curious transition in $\delta$ as a function of the sample complexity and contamination. Moreover, we derive the decay rates for the excess risk of ridge regression. We show that, while it is optimal and universal for noise distributions with finite second moment, its decay rate can be considerably faster when the covariates' second moment does not exist. Finally, we show that our formulas readily generalise to a richer family of models and data distributions, such as generalised linear estimation with arbitrary convex regularisation trained on mixture models.
|
Urte Adomaityte · Leonardo Defilippis · Bruno Loureiro · Gabriele Sicuro 🔗 |
-
|
Large Deviations and Metastability Analysis for Heavy-Tailed Dynamical Systems
(
Poster
)
>
link
We study large deviations and metastability of heavy-tailed stochastic dynamical systems and provide the heavy-tailed counterparts of the classical Freidlin-Wentzell and Eyring-Kramers theory. Our findings address the rare-event analysis for sufficiently general events and heavy-tailed dynamical systems. We also unveil an intricate phase transitions in the first exit problems under truncated heavytailed noises. Furthermore, our results provide tools to systematically study the connection between the global dynamics of the stochastic gradient descent (SGD) under heavy-tailed noises and the generalization mystery of deep learning. |
Chang-Han Rhee · Xingyu Wang 🔗 |
-
|
Associative Memories with Heavy-Tailed Data
(
Poster
)
>
link
Learning arguably involves the discovery and memorization of abstract rules.But how associative memories appear in transformer architectures optimized with gradient descent algorithms?We derive precise scaling laws for a simple input-output associative memory model with respect to parameter size, and discuss the statistical efficiency of different estimators, including optimization-based algorithms.We provide extensive numerical experiments to validate and interpret theoretical results, including fine-grained visualizations of the stored memory associations. |
Vivien Cabannes · Elvis Dohmatob · Alberto Bietti 🔗 |
-
|
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
(
Poster
)
>
link
We identify a new phenomenon in network optimization which arises from the interaction of depth and a particular heavy-tailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics. In particular, it implies a conceptually new cause for progressive sharpening and the edge of stability; we also highlight connections to other concepts in optimization and generalization including grokking, simplicity bias, and Sharpness-Aware Minimization.Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong opposing signals: consistent, large magnitude features which dominate the network output throughout training and provide gradients which point in opposite directions. We describe how to identify these groups, explore what sets them apart, and carefully study their effect on the network's optimization and behavior. We complement these experiments with a mechanistic explanation on a toy example of opposing signals and a theoretical analysis of a two-layer linear network on a simple model. Our finding enables new qualitative predictions of training behavior which we confirm experimentally. It also provides a new lens through which to study and improve modern training practices for stochastic optimization, which we highlight via a case study of Adam versus SGD. |
Elan Rosenfeld · Andrej Risteski 🔗 |
-
|
Deep neural networks with dependent weights: \\Gaussian Process mixture limit, heavy tails, sparsity and compressibility
(
Poster
)
>
link
This work studies the infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions. Under this model, we show that each layer of the infinite-width neural network can be characterised by two simple quantities: a non-negative scalar parameter and a L\'evy measure on the positive reals. If the scalar parameters are strictly positive and the L\'evy measures are trivial at all hidden layers, then one recovers the classical Gaussian process (GP) limit, obtained with iid Gaussian weights. More interestingly, if the L\'evy measure of at least one layer is non-trivial, we obtain a mixture of Gaussian processes (MoGP) in the large-width limit. The behaviour of the neural network in this regime is very different from the GP regime. One obtains correlated outputs, with non-Gaussian distributions, possibly with heavy tails. We illustrate some of the benefits of the MoGP regime over the GP regime in terms of representation learning and compressibility on simulated, MNIST and Fashion MNIST datasets. |
Hoil Lee · Fadhel Ayed · Paul Jung · Juho Lee · Hongseok Yang · Francois Caron 🔗 |
-
|
Adaptive Gradient Methods at the Edge of Stability
(
Poster
)
>
link
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at a certain numerical value --- the stability threshold of a gradient descent algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Adaptive Edge of Stability” (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning.
|
Jeremy M Cohen · Behrooz Ghorbani · Shankar Krishnan · Naman Agarwal · Sourabh Medapati · Michal Badura · Daniel Suo · Zachary Nado · George Dahl · Justin Gilmer 🔗 |
-
|
Online Student-$t$ Processes with an Overall-local Scale Structure for Modelling Non-stationary Data
(
Poster
)
>
link
Time-dependent data often exhibit characteristics, such as non-stationarity and heavy-tailed errors, that would be inappropriate to model with the typical assumptions used in popular models. Thus, more flexible approaches are required to be able to accommodate such issues. To this end, we propose a Bayesian mixture of student-$t$ processes with an overall-local scale structure for the covariance. Moreover, we use a sequential Monte Carlo (SMC) sampler in order to perform online inference as data arrive in real-time. We demonstrate the superiority of our proposed approach compared to typical Gaussian process-based models on real-world data sets in order to prove the necessity of using mixtures of student-$t$ processes.
|
Taole Sha · Michael Minyi Zhang 🔗 |
-
|
On the Varied Faces of Overparameterization in Supervised and Self-Supervised Learning
(
Poster
)
>
link
The quality of the representations learned by neural networks depends on several factors, including the loss function, learning algorithm, and model architecture. In this work, we use information geometric measures to assess the representation quality in a principled manner. We demonstrate that the sensitivity of learned representations to input perturbations, measured by the spectral norm of the feature Jacobian, provides valuable information about downstream generalization. On the other hand, measuring the coefficient of spectral decay observed in the eigenspectrum of feature covariance provides insights into the global representation geometry. First, we empirically establish an equivalence between these notions of representation quality and show that they are inversely correlated. Second, our analysis reveals the varying roles that overparameterization plays in improving generalization. Unlike supervised learning, we observe that increasing model width leads to higher discriminability and less smoothness in the self-supervised regime.Furthermore, we report that there is no observable double descent phenomenon in SSL with non-contrastive objectives for commonly used parameterization regimes, which opens up new opportunities for tight asymptotic analysis. Taken together, our results provide a loss-aware characterization of the different role of overparameterization in supervised and self-supervised learning. |
Matteo Gamba · Arna Ghosh · Kumar Krishna Agrawal · Blake Richards · Hossein Azizpour · Mårten Björkman 🔗 |
-
|
Neural network compression with heavy-tailed SGD
(
Poster
)
>
link
Neural network compression has been an increasingly important subject, due to its practical implications in terms of reducing the computational requirements and its theoretical implications, as there is an explicit connection between compressibility and the generalization error. Recent studies have shown that the choice of the hyperparameters of stochastic gradient descent (SGD) can have an effect on the compressibility of the learned parameter vector. Even though these results have shed some light on the role of the training dynamics over compressibility, they relied on unverifiable assumptions and the resulting theory does not provide a practical guideline due to its implicitness. In this study, we propose a simple modification for SGD, such that the outputs of the algorithm will be provably compressible without making any nontrivial assumptions. We consider a one-hidden-layer neural network trained with SGD and we inject additive heavy-tailed noise to the iterates at each iteration. We then show that, for any compression rate, there exists a level of overparametrization (i.e., the number of hidden units), such that the output of the algorithm will be compressible with high probability. We illustrate our approach on experiments, where the results suggest that the proposed approach achieves compressibility with a slight compromise from the training and test error. |
Yijun Wan · Abdellatif Zaidi · Umut Simsekli 🔗 |
-
|
Representation Learning for Extremes
(
Poster
)
>
link
Extreme events are potentially catastrophic events that occur infrequently within an observation time frame, and it is necessary to understand the distribution of these events to properly plan for them. Extreme value theory provides a theoretical framework for extrapolating to the tails of a distribution using limited observations. However, for high-dimensional data such as images, covariates are generally not extreme but perhaps the features are extreme. In this work, we propose a framework for learning representations according to properties of extreme value theory. Specifically, we use the max-stability property of extreme value distributions to inform the representations of the model such that they extrapolate to the rare data observations. We theoretically characterize the properties of the model and provide an identifiability result for the parameters of the latent distribution. Our preliminary results suggest the promise of the method for extrapolating to regions of the distribution with little density. |
Ali Hasan · Yuting Ng · Jose Blanchet · Vahid Tarokh 🔗 |
-
|
High Probability Guarantees for Random Reshuffling
(
Poster
)
>
link
We study the probabilistic behaviors of stochastic gradient descent with random reshuffling (RR) on nonconvex problems. We prove that the same complexity (except for a logrithmic term) as that of in expectation case also holds with high probability, which characterizes the performance of RR for a single run instead of averaging infinitely many realizations. Our analysis does not impose any additional assumptions on the stochastic gradient errors, which admits heavy tails. This is in contrast to high probabiltiy analyses of SGD that rely on sub-Gaussian stochastic gradient errors or tricks like clipping, momentum, etc. Furthermore, leveraging the established high probability error bounds, we propose a simple stopping criterion for RR that introduces few computational costs. We prove that the function value strictly decreases with a high probability before the stopping criterion is triggered, ensuring that the criterion will indeed be activated. Finally, a "last iterate'' result is built for the iteration returned with this stopping criterion. We believe that our new developments for RR serve as a stepping stone towards enabling more refined high probability analyses for characterizing its performance. |
Hengxu Yu · Xiao Li 🔗 |
-
|
Towards Fully Adaptive Regret Minimization in Heavy-Tailed Bandits
(
Poster
)
>
link
Heavy-tailed distributions naturally arise in many settings, from finance to telecommunications. While regret minimization under sub-Gaussian or bounded support rewards has been widely studied, learning on heavy-tailed distributions only gained popularity over the last decade.In the stochastic heavy-tailed bandit problem, an agent learns under the assumption that the distributions have finite moments of maximum order $1+\epsilon$ which are uniformly bounded by a constant $u$, for some $\epsilon \in (0,1]$. To the best of our knowledge, literature only provides algorithms requiring these two quantities as an input.In this paper we study the stochastic adaptive heavy-tailed bandit, a variation of the standard setting where both $\epsilon$ and $u$ are unknown to the agent.We show that adaptivity comes at cost, introducing two lower bounds on the regret of any adaptive algorithm implying an higher regret w.r.t. the standard setting. Finally, we introduce a specific distributional assumption and provide Adaptive Robust UCB, a regret minimization strategy matching the known lower bound for the heavy-tailed MAB problem.
|
Gianmarco Genalti · Lupo Marsigli · Nicola Gatti · Alberto Maria Metelli 🔗 |
-
|
Robust gradient estimation in the presence of heavy-tailed noise
(
Poster
)
>
link
In applications such as training transformers on NLP tasks, or distributed learning in the presence of corrupted nodes, the stochastic gradients have a heavy-tailed distribution. We argue that in these settings, momentum is not the best suited method for estimating the gradient. Instead, variants of momentum with different forms of clipping are better suited. Our argument is based on the following: in the presence of heavy tailed noise the sample median of the gradient is a better estimate than the sample mean. We then devise new iterative methods for computing the sample median on the fly based on the SPP (stochastic proximal point) method. These SPP methods applied to different definitions of median give rise to known and new type of clipped momentum estimates. We find that these clipped momentum estimates are more robust at estimating the gradient in the presence of noise coming from an alpha-stable distribution, and for a transformer architecture on the PTB and Wikitext-2 datasets, in particular when the batch size is large. |
Fabian Schaipp · Umut Simsekli · Robert Gower 🔗 |
-
|
Generalised Hyperbolic State-space Models for Inference in Dynamic Systems
(
Poster
)
>
link
In this work we study linear vector stochastic differential equation (SDE) models driven by the generalised hyperbolic (GH) L{\'e}vy process for inference in continuous-time non-Gaussian filtering problems. The GH family of stochastic processes offers a flexible framework for modelling of non-Gaussian, heavy-tailed characteristics and includes the normal inverse-Gaussian, variance-gamma and Student-t processes as special cases. We present continuous-time simulation methods for the solution of vector SDE models driven by GH processes and novel inference methodologies using a variant of sequential Markov chain Monte Carlo (MCMC). As an example a particular formulation of Langevin dynamics is studied within this framework. The model is applied to both a synthetically generated data set and a real-world financial series to demonstrate its capabilities. |
Yaman Kindap · Simon Godsill 🔗 |
-
|
Variational Inference for SDEs Driven by Fractional Noise
(
Poster
)
>
link
We present a novel variational framework for performing inference in (neural) stochastic differential equations (SDEs) driven by Markov-approximate fractional Brownian motion (fBM). SDEs offer a versatile tool for modeling real-world continuous-time dynamic systems with inherent noise and randomness. Combining SDEs with the powerful inference capabilities of variational methods, enables the learning of representative function distributions through stochastic gradient descent. However, conventional SDEs typically assume the underlying noise to follow a Brownian motion (BM), which hinders their ability to capture long-term dependencies. In contrast, fractional Brownian motion (fBM) extends BM to encompass non-Markovian dynamics, but existing methods for inferring fBM parameters are either computationally demanding or statistically inefficient. In this paper, building upon the Markov approximation of fBM, we derive the evidence lower bound essential for efficient variational inference of posterior path measures, drawing from the well-established field of stochastic analysis. Additionally, we provide a closed-form expression to determine optimal approximation coefficients. Furthermore, we propose the use of neural networks to learn the drift, diffusion and control terms within our variational posterior, leading to the variational training of neural-SDEs. In this framework, we also optimize the Hurst index, governing the nature of our fractional noise. Beyond validation on synthetic data, we contribute a novel architecture for variational latent video prediction,—an approach that, to the best of our knowledge, enables the first variational neural-SDE application to video perception. |
Rembert Daems · Manfred Opper · Guillaume Crevecoeur · Tolga Birdal 🔗 |
-
|
Revisiting the noise Model of SGD
(
Poster
)
>
link
The effectiveness of stochastic gradient descent (SGD) is significantly influenced by stochastic gradient noise (SGN). Following the central limit theorem, stochastic gradient noise (SGN) was initially described as Gaussian, but recently, Simsekli et al. demonstrated that SαS Lévy better characterizes the stochastic gradient noise. Here, we revisit the noise model of SGD and provide robust, comprehensive empirical evidence that SGN is heavy-tailed and is better represented by the SαS distribution. Furthermore, we argue that different deep neural network (DNN) parameters preserve distinct SGN properties throughout training. We develop a novel framework based on Lévy-driven stochastic differential equation (SDE), where one-dimensional Lévy processes describe each DNN parameter. This leads to a more accurate characterization of the dynamics of SGD around local minima. |
Barak Battash · Ofir Lindenbaum 🔗 |