Workshop
Mathematics of Modern Machine Learning (M3L)
Zhiyuan Li · Tengyu Ma · Surbhi Goel · Kaifeng Lyu · Christina Baek · Bingbin Liu · Alex Damian · Aditi Raghunathan
Room 242
Sat 16 Dec, 6:50 a.m. PST
This workshop explores theory for understanding and advancing modern ML practices: optimization, generalization, and foundation models.
Schedule
Sat 6:50 a.m.  7:00 a.m.

Opening Remarks
(
Opening Remarks
)
>
SlidesLive Video 
🔗 
Sat 7:00 a.m.  7:45 a.m.

From algorithms to neural networks and back
(
Invited Talk
)
>
SlidesLive Video An increasingly common design and analysis paradigm for neural networks is thinking of them as parametrizing (implicitly or explicitly) some algorithm. In images, scorebased generative models can be thought of as parametrizing a learned sampler (a stochastic differential equation or a Markov Chain). In scientific applications, PDE solvers are trained as neural analogues of numerical solvers. In language, we probe to understand whether transformers can solve simple algorithmic tasks like parsing. In this talk, I’ll share several vignettes illustrating the value of an algorithmic lens in these settings: namely, understanding the performance of “natural” algorithms will allow us to understand the performance of neural methods, as well as explore and elucidate the architectural design space. 
Andrej Risteski 🔗 
Sat 7:45 a.m.  8:30 a.m.

How do twolayer neural networks learn complex functions from data over time?
(
Invited Talk
)
>
SlidesLive Video How do twolayer neural networks learn complex functions from data over time? In this talk, we shall delve into the interaction between batch size, number of iterations, and task complexity, shedding light on neural network adaptation to data features. I will particularly highlight three key findings: i) The significant impact of a single gradient step on the feature learning, emphasizing the relationship between batch size and the target's information exponent (or complexity). ii) The enhancement of the network's approximation ability over multiple gradient steps, enabling the learning of more intricate functions over time. iii) The improvement in generalization compared to the basic random feature/kernel regime. Our theoretical approach combines techniques from statistical physics, concentration of measure, projectionbased conditioning, and Gaussian equivalence, which we believe holds standalone significance. Based on joint work with Yatin Dandi, Bruno Loureiro, Luca Pesce, and Ludovic Stephan (https://arxiv.org/pdf/2305.18270.pdf) 
Florent Krzakala 🔗 
Sat 8:30 a.m.  8:40 a.m.

Feature Learning in InfiniteDepth Neural Networks
(
Oral
)
>
link
SlidesLive Video
By classifying infinitewidth neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinitewidththendepth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinitedepth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

Greg Yang · Dingli Yu · Chen Zhu · Soufiane Hayou 🔗 
Sat 8:40 a.m.  8:50 a.m.

Fit Like You Sample: SampleEfficient Score Matching From Fast Mixing Diffusions
(
Oral
)
>
link
SlidesLive Video
Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. EnergyBased Models). The idea is to fit the score of the distribution (i.e. $\nabla_x \log p(x)$), rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical cost can be steep: recent work by (Koehler et al '22) showed that for distributions that have poor isoperimetric properties (a large Poincar'e or logSobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians in one dimensionhave a poor Poincar'e constant.In this paper, we show a close connection between the mixing time of a broad class of Markov processes with generator $\mathcal{L}$ and stationary distribution $p$, and an appropriately chosen generalized score matching loss that tries to fit $\frac{\mathcal{O} p}{p}$. In the special case of $\mathcal{O} = \nabla_x$, and $\mathcal{L}$ being the generator of Langevin diffusion, this generalizes and recovers the results from (Koehler et al '22). This allows us to adapt techniques to speed up Markov chains to construct better scorematching losses. In particular, "preconditioning" the diffusion can be translated to an appropriate "preconditioning" of the score loss. Lifting the chain by adding a temperature like in simulated tempering can be shown to result in a Gaussianconvolution annealed score matching loss, similar to (SongErmon '19). Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in $d$ dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covarianceobviating the Poincar'e constantbased lower bounds of the basic score matching loss shown in (Koehler et al '22).

Yilong Qin · Andrej Risteski 🔗 
Sat 8:50 a.m.  9:00 a.m.

Deep Networks as Denoising Algorithms: SampleEfficient Learning of Diffusion Models in HighDimensional Graphical Models
(
Oral
)
>
link
SlidesLive Video We investigate the efficiency of deep neural networks for approximating scoring functions in diffusionbased generative modeling. While existing approximation theories leverage the smoothness of score functions, they suffer from the curse of dimensionality for intrinsically highdimensional data. This limitation is pronounced in graphical models such as Markov random fields, where the approximation efficiency of score functions remains unestablished. To address this, we note score functions can often be wellapproximated in graphical models through variational inference denoising algorithms. Furthermore, these algorithms can be efficiently represented by neural networks. We demonstrate this through examples, including Ising models, conditional Ising models, restricted Boltzmann machines, and sparse encoding models. Combined with offtheshelf discretization error bounds for diffusionbased sampling, we provide an efficient sample complexity bound for diffusionbased generative modeling when the score function is learned by deep neural networks. 
Song Mei · Yuchen Wu 🔗 
Sat 9:00 a.m.  9:10 a.m.

Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data
(
Oral
)
>
link
SlidesLive Video Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize nearoptimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfittingachieving a perfect fit to training data with nearrandom performance on test databefore transitioning (''grokking'') to nearoptimal generalization later in training. In this work, we show that both of these phenomena provably occur in twolayer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100\% training accuracy, perfectly fitting the noisy labels in the training data, but achieves nearrandom test accuracy. At a later training step, the network achieves nearoptimal test accuracy while still fitting the random labels in the training data, exhibiting a ''grokking'' phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a nongeneralizable linear classifier after one step and gradually learns generalizable features in later steps. 
Zhiwei Xu · Yutong Wang · Spencer Frei · Gal Vardi · Wei Hu 🔗 
Sat 9:10 a.m.  10:10 a.m.

Poster Session
(
Poster Session
)
>
Posters in this session: A PACBayesian Perspective on the Interpolating Information Criterion Graph Neural Networks Benefit from Structural Information Provably: A Feature Learning Perspective Linear attention is (maybe) all you need (to understand transformer optimization) Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study Feature Learning in InfiniteDepth Neural Networks Variational Classification Implicit biases in multitask and continual learningfrom a backward error analysis perspective Spectrum Extraction and Clipping for Implicitly Linear Layers The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization CurvatureDimension Tradeoff for Generalization in Hyperbolic Space Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations Unveiling the Hessian's Connection to the Decision Boundary Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks Large Learning Rates Improve Generalization: But How Large Are We Talking About? Understanding the Role of Noisy Statistics in the Regularization Effect of Batch Normalization Generalization Guarantees of Deep ResNets in the MeanField Regime Theoretical Explanation for Generalization from Adversarial Perturbations InContext Convergence of Transformers How TwoLayer Neural Networks Learn, One (Giant) Step at a Time Two Facets of SDE Under an InformationTheoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States Unraveling the Complexities of Simplicity Bias: Mitigating and Amplifying Factors Transformers as Support Vector Machines Symmetric Meanfield Langevin Dynamics for Distributional Minimax Problems A Theoretical Study of Dataset Distillation Transformers Learn HigherOrder Optimization Methods for InContext Learning: A Study with Linear Models Introducing an Improved InformationTheoretic Measure of Predictive Uncertainty InContext Learning on Unstructured Data: Softmax Attention as a Mixture of Experts AttentionOnly Transformers and Implementing MLPs with Attention Heads Privacy at Interpolation: Precise Analysis for Random and NTK Features Denoising LowRank Data Under Distribution Shift: Double Descent and Data Augmentation A Theory of NonLinear Feature Learning with One Gradient Step in TwoLayer Neural Networks Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data How does Gradient Descent Learn Features  A Local Analysis for Regularized TwoLayer Neural Networks Understanding Transferable Representation Learning and Zeroshot Transfer in CLIP Provably Efficient CVaR RL in Lowrank MDPs Analysis of Task Transferability in Large Pretrained Classifiers On ScaleInvariant Sharpness Measures GibbsBased Information Criteria and the OverParameterized Regime Grokking modular arithmetic can be explained by margin maximization 
🔗 
Sat 10:10 a.m.  11:15 a.m.

Lunch Break
(
Lunch Break
)
>

🔗 
Sat 11:15 a.m.  12:00 p.m.

Benefits of learning with symmetries: eigenvectors, graph representations and sample complexity
(
Invited Talk
)
>
SlidesLive Video In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design. In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning or orthogonally equivariant learning. We will discuss targeted architectures that can universally express functions with the relevant invariances or equivariances  sign flips and changes of basis  and their theoretical and empirical benefits. Second, we will take a broader theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to a wide range of group actions. This talk is based on joint work with Joshua Robinson, Derek Lim, Behrooz Tahmasebi, Lingxiao Zhao, Tess Smidt, Suvrit Sra and Haggai Maron. 
Stefanie Jegelka 🔗 
Sat 12:00 p.m.  12:15 p.m.

Break
(
Break
)
>

🔗 
Sat 12:15 p.m.  1:00 p.m.

Adaptivity in Domain Adaptation and Friends
(
Invited Talk
)
>
SlidesLive Video Domain adaptation, transfer, multitask, meta, fewshots, or lifelong learning … these are all important recent directions in ML that all touch at the core of what we might mean by ‘AI’. As these directions all concern learning in heterogeneous and everchanging environments, they all share a central question: what information a 'source' distribution may have about a 'target' distribution, or put differently, which measures of discrepancy between distributions properly model such information. Our understanding of this central question is still rather fledgeling, with both positive and negative results. On one hand we show that traditional notions of distance and divergence between distributions (e.g., Wasserstein, TV, KL, Renyi) are in fact too conservative: a source may be 'far' from a target under such traditional notions, yet still admit much useful information about the target distribution. We then turn to the existence of 'adaptive' procedures, i.e., procedures which can optimally leverage such information in the source data without any prior distributional knowledge. Here the picture is quite nuanced: while various existing approaches turn out to be adaptive in usual settings with a single source and hypothesis class, no procedure can guarantee optimal rates adaptively in more general settings, e.g., settings with multiple source datasets (as in multitask learning), or settings with multiple hypothesis classes (as in model selection or hyperparameter tuning). Such negative results raise new questions, as they suggest that domain adaptation and related problems may benefit from more structure in practice than captured by current formalisms. The talk is based on joint work with collaborators over the last few years, namely, G. Martinet, S. Hanneke, J. Suk, Y. Mahdaviyeh, N. Galbraith. 
Samory Kpotufe 🔗 
Sat 1:00 p.m.  1:10 p.m.

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
(
Oral
)
>
link
SlidesLive Video
We study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization.We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR10 and ImageNet. Furthermore, using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a welldefined feature learning joint infinitewidth and infinitedepth limit and show convergence of finitesize network dynamics towards this limit.

Blake Bordelon · Lorenzo Noci · Mufan Li · Boris Hanin · Cengiz Pehlevan 🔗 
Sat 1:10 p.m.  1:20 p.m.

Understanding Transferable Representation Learning and Zeroshot Transfer in CLIP
(
Oral
)
>
link
SlidesLive Video Multimodal learning has become increasingly popular due to its ability to leverage information from different data sources. Recently, CLIP has emerged as an effective approach that employs visionlanguage contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zeroshot learning and textguided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zeroshot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIPtype approach, which achieves better performance than CLIP and other stateoftheart methods on benchmark datasets. 
Zixiang Chen · Yihe Deng · Yuanzhi Li · Quanquan Gu 🔗 
Sat 1:20 p.m.  1:30 p.m.

InContext Convergence of Transformers
(
Oral
)
>
link
SlidesLive Video
Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable incontext learning capability, where models can solve an unseen task by utilizing taskspecific prompts without further parameters finetuning. This also inspired recent theoretical studies aiming to understand the incontext learning mechanism of transformers, which however focused only on $\textbf{linear}$ transformers. In this work, we take the first step toward studying the learning dynamics of a onelayer transformer with $\textbf{softmax}$ attention trained via gradient descent in order to incontext learn linear function classes. We consider a structured data model, where each token is randomly sampled from a set of feature vectors in either balanced or imbalanced fashion. For data with balanced features, we establish the finitetime convergence guarantee with nearzero prediction error by navigating our analysis over two phases of the training dynamics of the attention map. More notably, for data with imbalanced features, we show that the learning dynamics take a stagewise convergence process, where the transformer first converges to a nearzero prediction error for the query tokens of dominant features, and then converges later to a nearzero prediction error for the query tokens of underrepresented features, respectively via one and four training phases. Our proof features new techniques for analyzing the competing strengths of two types of attention weights, the change of which determines different phases.

Yu Huang · Yuan Cheng · Yingbin Liang 🔗 
Sat 1:30 p.m.  1:40 p.m.

Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study
(
Oral
)
>
link
SlidesLive Video Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum ``amplifying'' the selfstabilization (Damian et al., 2023). 
Prin Phunyaphibarn · Junghyun Lee · Bohan Wang · Huishuai Zhang · Chulhee Yun 🔗 
Sat 1:40 p.m.  1:50 p.m.

Linear attention is (maybe) all you need (to understand transformer optimization)
(
Oral
)
>
link
SlidesLive Video Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training transformers by carefully studying a simple yet canonical linearized shallow transformer model. Specifically, we train linear transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized transformer model could actually be a valuable, realistic abstraction for understanding transformer optimization. 
Kwangjun Ahn · Xiang Cheng · Minhak Song · Chulhee Yun · Ali Jadbabaie · Suvrit Sra 🔗 
Sat 1:50 p.m.  2:00 p.m.

Closing Remarks
(
Closing Remarks
)
>
SlidesLive Video 
🔗 
Sat 2:00 p.m.  3:00 p.m.

Poster Session
(
Poster Session
)
>
Posters in this session: Overparameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning On the Computational Complexity of Inverting Generative Models FlowBased HighDimensionally Distributional Robust Optimization Transformers as Decision Makers: Provable InContext Reinforcement Learning via Supervised Pretraining How Do Transformers Learn InContext Beyond Simple Functions? A Case Study on Learning with Representations A Theoretical Explanation of Deep RL Performance in Stochastic Environments Deep Networks as Denoising Algorithms: SampleEfficient Learning of Diffusion Models in HighDimensional Graphical Models UnderParameterized Double Descent for Ridge Regularized Least Squares Denoising of Data on a Line Continual Learning for LongTailed Recognition: Bridging the Gap in Theory and Practice SimVAE: Narrowing the gap between Discriminative & Generative Representation Learning Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate On Compositionality and Emergence in Physical Systems Generativie Modeling Escaping Random Teacher Initialization Enhances Signal Propagation and Representations The Expressive Power of Transformers with Chain of Thought Transformers as MultiTask Feature Selectors: Generalization Analysis of InContext Learning Fit Like You Sample: SampleEfficient Score Matching From Fast Mixing Diffusions Towards the Fundamental Limits of Knowledge Transfer over Finite Domains Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization MoXCo:How I learned to stop exploring and love my local minima? Firstorder ANIL provably learns representations despite overparametrisation A DataDriven Measure of Relative Uncertainty for Misclassification Detection NonVacuous Generalization Bounds for Large Language Models Learning from setbacks: the impact of adversarial initialization on generalization performance Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit Estimating optimal PACBayes bounds with Hamiltonian Monte Carlo Divergence at the Interpolation Threshold: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult Toward Studentoriented Teacher Network Training for Knowledge Distillation Adaptive SharpnessAware Pruning for Robust Sparse Networks Invariant LowDimensional Subspaces in Gradient Descent for Learning Deep Matrix Factorizations How Structured Data Guides Feature Learning: A Case Study of the Parity Problem The Next Symbol Prediction Problem: PAClearning and its relation to Language Mode Why Do We Need Weight Decay for Overparameterized Deep Networks? The DoubleEdged Sword: Perception and Uncertainty in Inverse Problems ls NearInterpolators: Fast Norm Growth and Tempered NearOverfitting On robust overfitting: adversarial training induced distribution matters Are Graph Neural Networks Optimal Approximation Algorithms? JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention 
🔗 


A PACBayesian Perspective on the Interpolating Information Criterion
(
Poster
)
>
link
Deep learning is renowned for its theorypractice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent contradiction with the wellknown biasvariance tradeoff. While such phenomena have proven challenging to theoretically study for general models, the recently proposed Interpolating Information Criterion (IIC) provides a valuable theoretical framework to examine performance for overparameterized models. Using the IIC, a PACBayes bound is obtained for a general class of models, characterizing factors which influence generalization performance in the interpolating regime. From the provided bound, we quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, optimizer, and parameterinitialization scheme; the spectrum of the empirical neural tangent kernel; curvature of the loss landscape; and noise present in the data. 
Liam Hodgkinson · Chris van der Heide · Robert Salomone · Fred Roosta · Michael Mahoney 🔗 


Graph Neural Networks Benefit from Structural Information Provably: A Feature Learning Perspective
(
Poster
)
>
link
Graph neural networks (GNNs) have shown remarkable capabilities in learning from graphstructured data, outperforming traditional multilayer perceptrons (MLPs) in numerous graph applications. Despite these advantages, there has been limited theoretical exploration into why GNNs are so effective, particularly from the perspective of feature learning. This study aims to address this gap by examining the role of graph convolution in feature learning theory under a specific data generative model. We undertake a comparative analysis of the optimization and generalization between twolayer graph convolutional networks (GCNs) and their convolutional neural network (CNN) counterparts. Our findings reveal that graph convolution significantly enhances the regime of low test error over CNNs. This highlights a substantial discrepancy between GNNs and MLPs in terms of generalization capacity, a conclusion further supported by our empirical simulations on both synthetic and realworld datasets. 
Wei Huang · Yuan Cao · Haonan Wang · Xin Cao · Taiji Suzuki 🔗 


Linear attention is (maybe) all you need (to understand transformer optimization)
(
Poster
)
>
link
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training transformers by carefully studying a simple yet canonical linearized shallow transformer model. Specifically, we train linear transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized transformer model could actually be a valuable, realistic abstraction for understanding transformer optimization. 
Kwangjun Ahn · Xiang Cheng · Minhak Song · Chulhee Yun · Ali Jadbabaie · Suvrit Sra 🔗 


Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study
(
Poster
)
>
link
Although gradient descent with momentum is widely used in modern deep learning, a concrete understanding of its effects on the training trajectory still remains elusive. In this work, we empirically show that momentum gradient descent with a large learning rate and learning rate warmup displays large catapults, driving the iterates towards flatter minima than those found by gradient descent. We then provide empirical evidence and theoretical intuition that the large catapult is caused by momentum ``amplifying'' the selfstabilization (Damian et al., 2023). 
Prin Phunyaphibarn · Junghyun Lee · Bohan Wang · Huishuai Zhang · Chulhee Yun 🔗 


Feature Learning in InfiniteDepth Neural Networks
(
Poster
)
>
link
By classifying infinitewidth neural networks and identifying the *optimal* limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for *widthwise hyperparameter transfer*, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for *depthwise parametrizations* of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinitewidththendepth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify *feature diversity* as a crucial factor in deep networks, and Depth$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinitedepth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

Greg Yang · Dingli Yu · Chen Zhu · Soufiane Hayou 🔗 


Variational Classification
(
Poster
)
>
link
We present variational classification (VC), a latent variable generalisation of neural network softmax classification under crossentropy loss. Our approach provides a novel probabilistic interpretation of the highly familiar softmax classification model, to which it relates comparably to variational vs deterministic autoencoders. We derive a training objective based on the evidence lower bound (ELBO) that is nontrivial to optimize, and an adversarial approach to maximise it. We reveal an inherent inconsistency within softmax classification that VC addresses, while also allowing flexible choices of distributions in the latent space in place of assumptions implicit in standard softmax classifiers. Empirical evaluation demonstrates that VC maintains accuracy while improving properties such as calibration and adversarial robustness, particularly under distribution shift and low data settings. This work brings new theoretical insight to modern machine learning practice. 
Shehzaad Dhuliawala · Mrinmaya Sachan · Carl Allen 🔗 


Implicit biases in multitask and continual learningfrom a backward error analysis perspective
(
Poster
)
>
link
Using backward error analysis, we compute implicit training biases in multitask and continual learning settings for neural networks trained with stochastic gradient descent. In particular, we derive modified losses that are implicitly minimized during training. They have three terms: the original loss, accounting for convergence, an implicit flatness regularization term proportional to the learning rate, and a last term, the conflict term, which can theoretically be detrimental to both convergence and implicit regularization. In multitask, the conflict term is a wellknown quantity, measuring the gradient alignment between the tasks, while in continual learning the conflict term is a new quantity in deep learning optimization, although a basic tool in differential geometry: The Lie bracket between the task gradients. 
Benoit Dherin 🔗 


Spectrum Extraction and Clipping for Implicitly Linear Layers
(
Poster
)
>
link
We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. we provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. by comparing the accuracy and performance of our methods to existing methods, using various experiments, show they lead to better generalization and adversarial robustness of the models. in addition to these advantages over the stateoftheart methods, we show they are much faster than the alternatives. 
Ali EbrahimpourBoroojeny · Matus Telgarsky · Hari Sundaram 🔗 


The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization
(
Poster
)
>
link
Empirical studies have demonstrated that the noise in stochastic gradient descent (SGD) aligns favorably with the local geometry of loss landscape. However, theoretical and quantitative explanations for this phenomenon remain sparse. In this paper, we offer a comprehensive theoretical investigation into the aforementioned {\em noise geometry} for overparameterized linear (OLMs) models and twolayer neural networks. We scrutinize both average and directional alignments, paying special attention to how factors like sample size and input data degeneracy affect the alignment strength. As a specific application, we leverage our noise geometry characterizations to study how SGD escapes from sharp minima, revealing that the escape direction has significant components along flat directions. This is in stark contrast to GD, which escapes only along the sharpest directions. To substantiate our theoretical findings, both synthetic and realworld experiments are provided. 
Mingze Wang · Lei Wu 🔗 


CurvatureDimension Tradeoff for Generalization in Hyperbolic Space
(
Poster
)
>
link
The inclusion of taskrelevant geometric embeddings in deep learning models is an important emerging direction of research, particularly when using hierarchical data. For instance, negatively curved geometries such as hyperbolic spaces are known to allow lowdistortion embedding of treelike hierarchical structures, which Euclidean spaces do not afford. Learning techniques for hyperbolic spaces, such as Hyperbolic Neural Networks (HNNs), have shown empirical accuracy improvement over classical Deep Neural Networks in tasks involving semantic or multiscale information, such as recommender systems or molecular generation. However, no research has investigated generalization properties specific to such geometries. In this work, we introduce generalization bounds for learning tasks in hyperbolic spaces, marking the first time such bounds have been proposed. We highlight a previously unnoticed and important difference with Euclidean embedding models, namely, under embeddings into spaces of negative curvature $\kappa<0$ and dimension $d$, only the product $\sqrt{\kappa}\ d$ influences generalization bounds. Hence, the curvature parameter of the space can be varied at fixed $d$ with the same effect on generalization as when varying $d$.

Nicolás Alvarado · Hans Lobel · Mircea Petrache 🔗 


Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations
(
Poster
)
>
link
Existing research often posits spurious features as "easier" to learn than core features in neural network optimization, but the nuanced impact of their relative simplicity remains underexplored. In this paper, we propose a theoretical framework and associated synthetic dataset grounded in boolean function analysis. Our framework allows for finegrained control on both the relative complexity (compared to core features) and correlation strength (with respect to the label) of spurious features. Experimentally, we observe that the presence of stronger spurious correlations or simpler spurious features leads to a slower rate of learning for the core features in networks when trained with (stochastic) gradient descent. Perhaps surprisingly, we also observe that spurious features are not forgotten even when the network has perfectly learned the core features. We give theoretical justifications for these observations for the special case of learning with parity features on a onelayer hidden network. Our findings justify the success of retraining the last layer for accelerating core feature convergence and identify limitations of debiasing algorithms that exploit early learning of spurious features. We corroborate our findings through experiments on realworld vision datasets, thereby validating the practical relevance of our framework. 
GuanWen Qiu · Da Kuang · Surbhi Goel 🔗 


Unveiling the Hessian's Connection to the Decision Boundary
(
Poster
)
>
link
Understanding the properties of wellgeneralizing minima is at the heart of deep learning research. On the one hand, the generalization of neural networks has been connected to the decision boundary complexity, which is hard to study in the highdimensional input space. Conversely, the flatness of a minimum has become a controversial proxy for generalization. In this work, we provide the missing link between the two approaches and show that the Hessian top eigenvectors characterize the decision boundary learned by the neural network. Notably, the number of outliers in the Hessian spectrum is proportional to the complexity of the decision boundary. Based on this finding, we provide a new and straightforward approach to studying the complexity of a highdimensional decision boundary. 
Mahalakshmi Sabanayagam · Freya Behrens · Urte Adomaityte · Anna Dawid 🔗 


Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks
(
Poster
)
>
link
Convolutional residual neural networks (ConvResNets), though overparameterized, can achieve remarkable prediction performance in practice, which cannot be well explained by conventional wisdom. To bridge this gap, we study the performance of ConvResNeXts, which cover ConvResNets as a special case, trained with weight decay from the perspective of nonparametric classification. Our analysis allows for infinitely many building blocks in ConvResNeXts, and shows that weight decay implicitly enforces sparsity on these blocks. Specifically, we consider a smooth target function supported on a lowdimensional manifold, then prove that ConvResNeXts can adapt to the function smoothness and lowdimensional structures and efficiently learn the function without suffering from the curse of dimensionality. Our findings partially justify the advantage of overparameterized ConvResNeXts over conventional machine learning models. 
Zixuan Zhang · Kaiqi Zhang · Minshuo Chen · Yuma Takeda · Mengdi Wang · Tuo Zhao · YuXiang Wang 🔗 


Large Learning Rates Improve Generalization: But How Large Are We Talking About?
(
Poster
)
>
link
Inspired by recent research that recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization, we explore this hypothesis in detail. Our study clarifies the initial LR ranges that provide optimal results for subsequent finetuning or weight averaging. We find that these ranges are in fact significantly narrower than generally assumed. We conduct our main experiments in a simplified setup that allows precise control of the learning rate hyperparameter and validate our key findings in a more practical setting. 
Ekaterina Lobacheva · Eduard Pokonechny · Maxim Kodryan · Dmitry Vetrov 🔗 


Understanding the Role of Noisy Statistics in the Regularization Effect of Batch Normalization
(
Poster
)
>
link
Normalization layers have been shown to benefit the training stability and generalization of deep neural networks in various ways. For Batch Normalization (BN), the noisy statistics have been observed to have a regularization effect that depends on the batch size. Following this observation, Hoffer et. al. proposed Ghost Batch Normalization (GBN), where BN is explicitly performed independently on smaller subbatches, resulting in improved generalization in many settings. In this study we analyze and isolate the effect of the noisy statistics by comparing BN and GBN, introducing a noise injection method. We then quantitatively assess the effects of the noise, juxtaposing it with other regularizers like dropout and examining its potential role in the generalization disparities between batch normalization and its alternatives, including layer normalization and normalizationfree methods. 
Atli Kosson · Dongyang Fan · Martin Jaggi 🔗 


Generalization Guarantees of Deep ResNets in the MeanField Regime
(
Poster
)
>
link
Despite the widespread empirical success of ResNet, the generalization ability of deep ResNet is rarely explored beyond the lazytraining regime. In this work, we investigate ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the largeneural network limit, i.e., the \emph{meanfield} regime.To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional timeinvariant Gram matrix employed in the lazy training regime to a timevariant, distributiondependent version tailored to the meanfield regime.To this end, we provide a lower bound on the minimum eigenvalue of the Gram matrix under the meanfield regime.Besides, the traceability of the dynamic of KullbackLeibler (KL) divergence is also required under the meanfield regime.We therefore establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution.The above two results are employed to build the uniform convergence for generalization bound via Rademacher complexity.Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks. 
Yihang Chen · Fanghui Liu · Yiping Lu · Grigorios Chrysos · Volkan Cevher 🔗 


Theoretical Explanation for Generalization from Adversarial Perturbations
(
Poster
)
>
link
It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies hypothesized that adversarial perturbations contain data features that are imperceptible to humans but still recognizable by neural networks. Empirical evidence has shown that neural networks trained on mislabeled samples with these perturbations can generalize to natural test data. However, a theoretical understanding of this counterintuitive phenomenon is limited. In this study, assuming orthogonal training samples, we first prove that onehiddenlayer neural networks can learn natural data structures from adversarial perturbations. Our results indicate that, under mild conditions, the decision boundary from learning perturbations aligns with that from natural data, except for specific points in the input space. 
Soichiro Kumano · Hiroshi Kera · Toshihiko Yamasaki 🔗 


InContext Convergence of Transformers
(
Poster
)
>
link
Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable incontext learning capability, where models can solve an unseen task by utilizing taskspecific prompts without further parameters finetuning. This also inspired recent theoretical studies aiming to understand the incontext learning mechanism of transformers, which however focused only on $\textbf{linear}$ transformers. In this work, we take the first step toward studying the learning dynamics of a onelayer transformer with $\textbf{softmax}$ attention trained via gradient descent in order to incontext learn linear function classes. We consider a structured data model, where each token is randomly sampled from a set of feature vectors in either balanced or imbalanced fashion. For data with balanced features, we establish the finitetime convergence guarantee with nearzero prediction error by navigating our analysis over two phases of the training dynamics of the attention map. More notably, for data with imbalanced features, we show that the learning dynamics take a stagewise convergence process, where the transformer first converges to a nearzero prediction error for the query tokens of dominant features, and then converges later to a nearzero prediction error for the query tokens of underrepresented features, respectively via one and four training phases. Our proof features new techniques for analyzing the competing strengths of two types of attention weights, the change of which determines different phases.

Yu Huang · Yuan Cheng · Yingbin Liang 🔗 


How TwoLayer Neural Networks Learn, One (Giant) Step at a Time
(
Poster
)
>
link
We investigate theoretically how the features of a $2$layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization.We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n =\mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n=\mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n=\mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batchsize of $n =\mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allow to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projectionbased conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural nets adapt to features of the data.

Yatin Dandi · Florent Krzakala · Bruno Loureiro · Luca Pesce · Ludovic Stephan 🔗 


Two Facets of SDE Under an InformationTheoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
(
Poster
)
>
link
Stochastic differential equations (SDEs) have been shown recently to well characterize the dynamics of training machine learning models with SGD. This provides two opportunities for better understanding the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as fullbatch gradient descent with Gaussian gradient noise allows us to obtain trajectoriesbased generalization bound using the informationtheoretic bound. Secondly, assuming mild conditions, we estimate the steadystate weight distribution of SDE and use the informationtheoretic bound to establish terminalstatebased generalization bounds. 
Ziqiao Wang · Yongyi Mao 🔗 


Unraveling the Complexities of Simplicity Bias: Mitigating and Amplifying Factors
(
Poster
)
>
link
The success of neural networks depends on the generalization ability, while Shah et al. conclude that the inherent bias towards simplistic features, Simplicity Bias, hurts generalization by preferring simple but noisy features to complex yet predictive ones. We aim to understand the scenarios when simplicity bias occurs more severely and the factors that are helpful in mitigating its effects. We show that many traditional insights such as increasing training size and increasing informative feature dimensions are not as effective as balancing the modes of our data distribution, distorting the simplistic features, or even searching for a good initialization. Our empirical results reveal intriguing factors to simplicity bias, and we call for future investigations to a more thorough understanding of simplicity bias and its interplay among other related fields. 
Xuchen Gong · Tianwen Fu 🔗 


Transformers as Support Vector Machines
(
Poster
)
>
link
The transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as $\texttt{softmax}(XQK^\top X^\top)$, where $(K,Q)$ are the trainable keyquery parameters. In this work, we establish a formal equivalence between the optimization geometry of selfattention and a hardmargin SVM problem that separates optimal input tokens from nonoptimal tokens using linear constraints on the outerproducts of token pairs. This formalism allows us to characterize the implicit bias of 1layer transformers optimized with gradient descent: (1) Optimizing the attention layer, parameterized by $(K,Q)$, with vanishing regularization, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W:=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm SVM objective. (2) Complementing this, for $W$parameterization, we prove the local/global directional convergence of gradient descent under suitable geometric conditions, and propose a more general SVM equivalence that predicts the implicit bias of attention with nonlinear heads/MLPs.

Davoud Ataee Tarzanagh · Yingcong Li · Christos Thrampoulidis · Samet Oymak 🔗 


Symmetric Meanfield Langevin Dynamics for Distributional Minimax Problems
(
Poster
)
>
link
In this paper, we extend meanfield Langevin dynamics to minimax optimization over probability distributions for the first time with symmetric and provably convergent updates. We propose \emph{meanfield Langevin averaged gradient} (MFLAG), a singleloop algorithm that implements gradient descent ascent in the distribution spaces with a novel weighted averaging, and establish averageiterate convergence to the mixed Nash equilibrium. We also study both time and particle discretization regimes and prove a new uniformintime propagation of chaos result which accounts for the dependency of the particle interactions on all previous distributions. Furthermore, we propose \emph{meanfield Langevin anchored best response} (MFLABR), a symmetric doubleloop algorithm based on best response dynamics with linear lastiterate convergence. Finally, we study applications to zerosum Markov games and conduct simulations demonstrating longterm optimality. 
Juno Kim · Kakei Yamamoto · Kazusato Oko · Zhuoran Yang · Taiji Suzuki 🔗 


A Theoretical Study of Dataset Distillation
(
Poster
)
>
link
Modern machine learning models are often trained using massive amounts of data. Such large datasets come at a high cost in terms of both storage and computation, especially when the data will need to be used repeatedly (e.g., for neural architecture search or continual learning). Dataset distillation (DD) describes the process of constructing a smaller ``distilled'' dataset (usually consisting of synthetic examples), such that models trained on the distilled dataset will be similar to models trained on the original dataset. In this paper, we study DD from a theoretical perspective. We show that for generalized linear models, it is possible to construct a distilled dataset with only a single point which will exactly recover the model trained on the original dataset, regardless of the original number of points. We provide a specialized distillation for linear regression with size independent of the original number of points, but which perfectly reconstructs the model obtained from the original dataset with any dataindependent regularizer, or by combining the original dataset with any additional data. We also provide impossibility results showing that similar constructions are impossible for logistic regression, and that DD cannot be accomplished in general for kernel regression, even if the goal is only to recover a single model. 
Zachary Izzo · James Zou 🔗 


Transformers Learn HigherOrder Optimization Methods for InContext Learning: A Study with Linear Models
(
Poster
)
>
link
Transformers are remarkably good at *incontext learning* (ICL)learning from demonstrations without parameter updatesbut how they perform ICL remains a mystery. Recent work suggests that Transformers may learn incontext by internally running Gradient Descent, a firstorder optimization method. In this paper, we instead demonstrate that Transformers learn to implement higherorder optimization methods to perform ICL. Focusing on incontext linear regression, we show that Transformers learn to implement an algorithm very similar to *Iterative Newton's Method*, a higherorder optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method, with each middle layer roughly computing 3 iterations; Gradient Descent is a much poorer match for the Transformer. We also show that Transformers can learn incontext on illconditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.

Deqing Fu · Tianqi Chen · Robin Jia · Vatsal Sharan 🔗 


Introducing an Improved InformationTheoretic Measure of Predictive Uncertainty
(
Poster
)
>
link
Applying a machine learning model for decisionmaking in the real world requires to distinguish what the model knows from what it does not. A critical factor in assessing the knowledge of a model is to quantify its predictive uncertainty. Predictive uncertainty is commonly measured by the entropy of the Bayesian model average (BMA) predictive distribution. Yet, the properness of this current measure of predictive uncertainty was recently questioned. We provide new insights regarding those limitations. Our analyses show that the current measure erroneously assumes that the BMA predictive distribution is equivalent to the predictive distribution of the true model that generated the dataset. Consequently, we introduce a theoretically grounded measure to overcome these limitations. We experimentally verify the benefits of our introduced measure of predictive uncertainty. We find that our introduced measure behaves more reasonably in controlled synthetic tasks. Moreover, our evaluations on ImageNet demonstrate that our introduced measure is advantageous in realworld applications utilizing predictive uncertainty. 
Kajetan Schweighofer · Lukas Aichberger · Mykyta Ielanskyi · Sepp Hochreiter 🔗 


InContext Learning on Unstructured Data: Softmax Attention as a Mixture of Experts
(
Poster
)
>
link
Transformers have exhibited impressive incontext learning (ICL) capabilities: they can generate predictions for new query inputs based on sequences of inputs and outputs (i.e., prompts) without parameter updates. Despite numerous attempts to theoretically justify why these capabilities emerge, they often assume that inputoutput pairings in the data are known. This assumption can enable simplified transformers (e.g., ones using a single attention layer without the softmax activation) to achieve notable ICL performance. However, this assumption can often be unrealistic: the unstructured data that transformers are trained on rarely contains such inputoutput pairings. How does ICL emerge in unstructured data then? To answer this question, we propose a challenging task that does not assume prior knowledge of input and output pairings. Unlike previous studies, this new task elucidates the pivotal role of softmax attention in the robust ICL abilities of transformers, particularly ones with a single attention layer. We argue that the significance of the softmax activation stems from the provable equivalence of softmaxbased attention models with mixtures of experts, thereby facilitating the inference of inputoutput pairings from the data. Additionally, a probing analysis sheds light on when these pairings are learned within the model. While subsequent layers predictably encode more information about these pairings, we find that even the first attention layer contains a significant amount of pairing information. 
Kevin Christian Wibisono · Yixin Wang 🔗 


AttentionOnly Transformers and Implementing MLPs with Attention Heads
(
Poster
)
>
link
The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLPandattention transformer into an attentiononly transformer at the cost of greatly increasing the number of attention heads. 
Robert Huben · Valerie Morris 🔗 


Privacy at Interpolation: Precise Analysis for Random and NTK Features
(
Poster
)
>
link
Deep learning models are able tomemorize the training set. This makes them vulnerable to recovery attacks, raising privacy concerns to users, and many widespread algorithms such as empirical risk minimization (ERM) do not directly enforce safety guarantees. In this paper, we study the safety of ERM models when the training samples are interpolated (i.e., at interpolation) against a family of powerful blackbox information retrieval attacks. Our analysis quantifies this safety via two separate terms: (i) the model stability with respect to individual training samples, and (ii) the feature alignment between attacker query and original data. While the first term is well established in learning theory and it is connected to the generalization error in classical work, the second one is, to the best of our knowledge, novel.Our key technical result characterizes precisely the feature alignment for the two prototypical settings of random features (RF) and neural tangent kernel (NTK) regression. This proves that privacy strengthens with an increase in generalization capability, unveiling the role of the model and of its activation function. Numerical experiments show an agreement with our theory not only for RF/NTK models, but also for deep neural networks trained on standard datasets (MNIST, CIFAR10). 
Simone Bombari · Marco Mondelli 🔗 


Denoising LowRank Data Under Distribution Shift: Double Descent and Data Augmentation
(
Poster
)
>
link
Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisyinput regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to reallife data and modern machine learning. First, while most past theoretical work assumes that the data covariance matrix is fullrank and wellconditioned, empirical studies have shown that reallife data is approximately lowrank. Thus, we assume that our data matrices are lowrank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it important to study nonclassical regimes of learning. Thus, we work in the nonclassical proportional regime, where data dimension $d$ and number of samples $N$ grow as $d/N = c + o(1)$. For this setting, we derive general test error expressions for both denoising and noisyinput regression, and study when overfitting the noise is benign, tempered or catastrophic. We show that the test error exhibits double descent under general distribution shift, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using reallife data, where we match the theoretical predictions with under 1\% MSE error for lowrank data.

Chinmaya Kausik · Kashvi Srivastava · Rishi Sonthalia 🔗 


A Theory of NonLinear Feature Learning with One Gradient Step in TwoLayer Neural Networks
(
Poster
)
>
link
Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in twolayer fullyconnected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rankone componentspikein the spectrum of the feature matrix.However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning nonlinear components is impossible.We show that with a learning rate that grows with the sample size,such training in fact introduces multiple rankone components, each corresponding to a specific polynomial feature.We further prove that the limiting largedimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these nonlinear features can enhance learning. 
Behrad Moniri · Donghwan Lee · Hamed Hassani · Edgar Dobriban 🔗 


Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data
(
Poster
)
>
link
Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize nearoptimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfittingachieving a perfect fit to training data with nearrandom performance on test databefore transitioning (''grokking'') to nearoptimal generalization later in training. In this work, we show that both of these phenomena provably occur in twolayer ReLU networks trained by GD on XOR cluster data where a constant fraction of the training labels are flipped. In this setting, we show that after the first step of GD, the network achieves 100\% training accuracy, perfectly fitting the noisy labels in the training data, but achieves nearrandom test accuracy. At a later training step, the network achieves nearoptimal test accuracy while still fitting the random labels in the training data, exhibiting a ''grokking'' phenomenon. This provides the first theoretical result of benign overfitting in neural network classification when the data distribution is not linearly separable. Our proofs rely on analyzing the feature learning process under GD, which reveals that the network implements a nongeneralizable linear classifier after one step and gradually learns generalizable features in later steps. 
Zhiwei Xu · Yutong Wang · Spencer Frei · Gal Vardi · Wei Hu 🔗 


How does Gradient Descent Learn Features  A Local Analysis for Regularized TwoLayer Neural Networks
(
Poster
)
>
link
The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradientbased training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture groundtruth directions. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training. 
Mo Zhou · Rong Ge 🔗 


Understanding Transferable Representation Learning and Zeroshot Transfer in CLIP
(
Poster
)
>
link
Multimodal learning has become increasingly popular due to its ability to leverage information from different data sources. Recently, CLIP has emerged as an effective approach that employs visionlanguage contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zeroshot learning and textguided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zeroshot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIPtype approach, which achieves better performance than CLIP and other stateoftheart methods on benchmark datasets. 
Zixiang Chen · Yihe Deng · Yuanzhi Li · Quanquan Gu 🔗 


Provably Efficient CVaR RL in Lowrank MDPs
(
Poster
)
>
link
We study risksensitive Reinforcement Learning (RL), where we aim to maximizethe Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior theoretical work studying risksensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in lowrank MDPs with nonlinear function approximation. Lowrank MDPs assume the underlying transition kernel admits a lowrank decomposition, but unlike prior linear models, lowrank MDPs do not assume the feature or stateaction representation is known. We propose a novel Upper Confidence Bound (UCB) bonusdriven algorithm to carefully balance the interplay between exploration, exploitation, and representation learning in CVaR RL. We prove that our algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2 d^4}{\tau^2 \epsilon^2}\right)$ rate to yield an $\epsilon$optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.Computationalwise, we design a novel discretized LeastSquares Value Iteration (LSVI) algorithm for the CVaR objective as the planning oracle and show that we can find the nearoptimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVaR RL algorithm in lowrank MDPs.

Yulai Zhao · Wenhao Zhan · Xiaoyan Hu · Hofung Leung · Farzan Farnia · Wen Sun · Jason Lee 🔗 


Analysis of Task Transferability in Large Pretrained Classifiers
(
Poster
)
>
link
Transfer learning is a cornerstone of modern machine learning, enabling models to transfer the knowledge acquired from a source task to downstream target tasks with minimal finetuning. However, the relationship between the source task performance and the downstream target task performance (i.e., transferability) is poorly understood. In this work, we rigorously analyze the transferability of large pretrained models on downstream classification tasks after linear finetuning. We use a novel Task Transfer Analysis approach that transforms the distribution (and classifier) of the source task to produce a new distribution (and classifier) similar to that of the target task. Using this, we propose an upper bound on transferability composed of the Wasserstein distance between the transformed source and the target distributions, the conditional entropy between the label distributions of the two tasks, and the weighted loss of the source classifier on the source task. We propose an optimization problem that minimizes the proposed bound to estimate transferability. Using stateoftheart pretrained models, we show that the proposed upper bound accurately estimates transferability on various datasets and demonstrates the importance of high relatedness between the source and target tasks for achieving high transferability. 
Akshay Mehra · Yunbei Zhang · Jihun Hamm 🔗 


On ScaleInvariant Sharpness Measures
(
Poster
)
>
link
Recently, there has been a substantial surge of interest in the development of optimization algorithms tailored for overparameterized models. This interest centers around the objective of minimizing a concept of sharpness in conjunction with the original loss function, e.g., the SharpnessAware Minimization (SAM) algorithm shown effective in practice. Nevertheless, the majority of sharpness measures exhibit sensitivity to parameter scaling in neural networks, and they may even experience significant magnification when subjected to rescaling operations. Motivated by this issue, in this paper, we introduce a new class of scaleinvariant sharpness measures, that gives rise to a new class of scaleinvariant sharpnessaware objective functions. Furthermore, we prove that the newly introduced objective functions are explicitly biased towards the minimization of our scaleinvariant sharpness measures. 
Behrooz Tahmasebi · Ashkan Soleymani · Stefanie Jegelka · Patrick Jaillet 🔗 


GibbsBased Information Criteria and the OverParameterized Regime
(
Poster
)
>
link
Doubledescent refers to the unexpected drop in test loss of a learning algorithm beyond an interpolating threshold with overparameterization, which is not predicted by information criteria in their classical forms due to the limitations in the standard asymptotic approach. We update these analyses using the information risk minimization framework and provide Bayesian Information Criterion (BIC) for models trained by the Gibbs algorithm. Notably, the BIC penalty term for the Gibbs algorithm corresponds to a specific information measure, i.e., KL divergence. We extend this informationtheoretic analysis to overparameterized models by characterizing the Gibbsbased BIC for the random feature model in the regime where the number of parameters $p$ and the number of samples $n$ tend to infinity, with $p/n$ fixed. Our experiments demonstrate that the Gibbsbased BIC can select the highdimensional model and reveal the mismatch between marginal likelihood and population risk in the overparameterized regime, providing new insights for understanding the doubledescent phenomenon.

Haobo Chen · Yuheng Bu · Gregory Wornell 🔗 


Grokking modular arithmetic can be explained by margin maximization
(
Poster
)
>
link
We present a marginbased generalization theory explaining the “grokking” phenomenon (Power et, al. 2022), where the model generalizes long after overfitting to arithmetic datasets. Specifically, we study twolayer quadratic networks on mod$p$ arithmetic problems, and show that solutions with maximal margin normalized by $\ell_\infty$ norm generalize with $\tilde O(p^{5/3})$ samples. To the best of our knowledge, this is the first sample complexity bound strictly better than a trivial $O(p^2)$ complexity for modular addition. Empirically, we find that GD on unregularized $\ell2$ or cross entropy loss tend to maximize the margin. In contrast, we show that kernelbased models, such as networks that are wellapproximated by their neural tangent kernel, need $\Omega(p^2)$ samples to achieve nontrivial $\ell_2$ loss. Our theory suggests that grokking might be caused by overfitting in the kernel regime of early training, followed by generalization as gradient descent eventually leaves the kernel regime and maximizes the normalized margin.

Mohamad Amin Mohamadi · Zhiyuan Li · Lei Wu · Danica J. Sutherland 🔗 


Overparameterised Shallow Neural Networks with Asymmetrical Node Scaling: \\ Global Convergence Guarantees and Feature Learning
(
Poster
)
>
link
We consider gradientbased optimisation of wide, shallow neural networks with hiddennode ouputs scaled by positive scale parameters. The scale parameters are nonidentical, differing from classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. 
Fadhel Ayed · Francois Caron · Paul Jung · Juho Lee · Hoil Lee · Hongseok Yang 🔗 


On the Computational Complexity of Inverting Generative Models
(
Poster
)
>
link
The objective of generative model inversion is to identify a size$n$ latent vector that produces a generative model output that closely matches a given target. This operation is a core computational primitive in numerous modern applications involving computer vision and NLP. However, the problem is known to be computationally challenging and NPhard in the worst case. This paper aims to provide a finegrained view of the landscape of computational hardness for this problem. We establish several new hardness lower bounds for both exact and approximate model inversion. In exact inversion, the goal is to determine whether a target is contained within the range of a given generative model. Under the strong exponential time hypothesis (SETH), we demonstrate that the computational complexity of exact inversion is lower bounded by $\Omega(2^n)$ via a reduction from $k$SAT; this is a strengthening of known results. For the more practically relevant problem of approximate inversion, the goal is to determine whether a point in the model range is close to a given target with respect to the $\ell_p$norm. When $p$ is a positive odd integer, under SETH, we provide an $\Omega(2^n)$ complexity lower bound via a reduction from the closest vectors problem (CVP). Finally, when $p$ is even, under the exponential time hypothesis (ETH), we provide a lower bound of $2^{\Omega (n)}$ via a reduction from HalfClique and VertexCover.

Feyza Duman Keles · Chinmay Hegde 🔗 


FlowBased HighDimensionally Distributional Robust Optimization
(
Poster
)
>
link
Flowbased models establish a continuoustime invertible transport map between a data distribution and a prespecified target distribution, such as the standard Gaussian in normalizing flow. In this work, we study beyond the constraint of known target distributions. We specifically aim to find the worstcase distribution in distributional robust optimization (DRO), which is an infinitedimensional problem that becomes particularly challenging in highdimensional settings. To this end, we introduce a computational tool called FlowDRO Specifically, we reformulate the difficult task of identifying the worstcase distribution within a Wasserstein2 uncertainty set into a more manageable form, i.e., training parameters of a corresponding flowbased neural network. Notably, the proposed FlowDRO is applicable to general risk functions and data distributions in DRO. We demonstrate the effectiveness of the proposed approach in various highdimensional problems that can be viewed as DRO, including adversarial attack and differential privacy. 
Chen Xu · Jonghyeok Lee · Xiuyuan Cheng · Yao Xie 🔗 


Transformers as Decision Makers: Provable InContext Reinforcement Learning via Supervised Pretraining
(
Poster
)
>
link
Large transformer models pretrained on offline reinforcement learning datasets have demonstrated remarkable incontext reinforcement learning (ICRL) capabilities, where they can make good decisions when prompted with interaction trajectories from unseen environments. However, when and how transformers can be trained to perform ICRL have not been theoretically wellunderstood. In particular, it is unclear which reinforcementlearning algorithms transformers can perform in context, and how distribution mismatch in offline training data affects the learned algorithms. This paper provides a theoretical framework that analyzes supervised pretraining for ICRL. This includes two recently proposed training methods  algorithm distillation and decisionpretrained transformers. First, assuming model realizability, we prove the supervisedpretrained transformer will imitate the conditional expectation of the expert algorithm given the observed trajectory. The generalization error will scale with model capacity and a distribution divergence factor between the expert and offline algorithms. Second, we show transformers with ReLU attention can efficiently approximate nearoptimal online reinforcement learning algorithms like LinUCB and Thompson sampling for stochastic linear bandits, and UCBVI for tabular Markov decision processes. This provides the first quantitative analysis of the ICRL capabilities of transformers pretrained from offline trajectories. 
Licong Lin · Yu Bai · Song Mei 🔗 


How Do Transformers Learn InContext Beyond Simple Functions? A Case Study on Learning with Representations
(
Poster
)
>
link
While large language models based on the transformer architecture have demonstrated remarkable incontext learning (ICL) capabilities, understandings of such capabilities are still in an early stage, where existing theory and mechanistic understanding focus mostly on simple scenarios such as learning simple function classes. This paper takes initial steps on understanding ICL in more complex scenarios, by studying learning with \emph{representations}. Concretely, we construct synthetic incontext learning problems with a compositional structure, where the label depends on the input through a possibly complex but \emph{fixed} representation function, composed with a linear function that \emph{differs} in each instance. By construction, the optimal ICL algorithm first transforms the inputs by the representation function, and then performs linear ICL on top of the transformed dataset. We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size. Empirically, we find trained transformers consistently achieve nearoptimal ICL performance in this setting, and exhibit the desired dissection where lower layers transforms the dataset and upper layers perform linear ICL. Through extensive probing and a new pasting experiment, we further reveal several mechanisms within the trained transformers, such as concrete copying behaviors on both the inputs and the representations, linear ICL capability of the upper layers alone, and a postICL representation selection mechanism in a harder mixture setting. These observed mechanisms align well with our theory and may shed light on how transformers perform ICL in more realistic scenarios. 
Tianyu Guo · Wei Hu · Song Mei · Huan Wang · Caiming Xiong · Silvio Savarese · Yu Bai 🔗 


A Theoretical Explanation of Deep RL Performance in Stochastic Environments
(
Poster
)
>
link
Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a nearoptimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fittedQ iteration over those rollouts. We find that any regression algorithm that satisfies basic indistribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works with complex function approximators like neural networks, since it is empirically established that neural networks generalize well indistribution. Furthermore, SQIRL explains why random exploration works well in practice, since we show many environments can be solved by effectively estimating the random policy's Qfunction and then applying zero or a few steps of value iteration. We leverage SQIRL to derive instancedependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead—which is typically much smaller than the full horizon—and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. 
Cassidy Laidlaw · Banghua Zhu · Stuart J Russell · Anca Dragan 🔗 


Deep Networks as Denoising Algorithms: SampleEfficient Learning of Diffusion Models in HighDimensional Graphical Models
(
Poster
)
>
link
We investigate the efficiency of deep neural networks for approximating scoring functions in diffusionbased generative modeling. While existing approximation theories leverage the smoothness of score functions, they suffer from the curse of dimensionality for intrinsically highdimensional data. This limitation is pronounced in graphical models such as Markov random fields, where the approximation efficiency of score functions remains unestablished. To address this, we note score functions can often be wellapproximated in graphical models through variational inference denoising algorithms. Furthermore, these algorithms can be efficiently represented by neural networks. We demonstrate this through examples, including Ising models, conditional Ising models, restricted Boltzmann machines, and sparse encoding models. Combined with offtheshelf discretization error bounds for diffusionbased sampling, we provide an efficient sample complexity bound for diffusionbased generative modeling when the score function is learned by deep neural networks. 
Song Mei · Yuchen Wu 🔗 


UnderParameterized Double Descent for Ridge Regularized Least Squares Denoising of Data on a Line
(
Poster
)
>
link
In this paper, we present a simple example that provably exhibits double descent in the underparameterized regime. For simplicity, we look at the ridge regularized least squares denoising problem with data on a line embedded in highdimension space. By deriving an asymptotically accurate formula for the generalization error, we observe samplewise and parameterwise double descent with the peak in the underparameterized regime rather than at the interpolation point or in the overparameterized regime. Further, the peak of the samplewise double descent curve corresponds to a peak in the curve for the norm of the estimator, and adjusting $\mu$, the strength of the ridge regularization, shifts the location of the peak. We observe that parameterwise double descent occurs for this model for small $\mu$. For larger values of $\mu$, we observe that the curve for the norm of the estimator has a peak but that this no longer translates to a peak in the generalization error.

Rishi Sonthalia · Xinyue (Serena) Li · Bochao Gu 🔗 


Continual Learning for LongTailed Recognition: Bridging the Gap in Theory and Practice
(
Poster
)
>
link
The LongTailed Recognition (LTR) problem arises in imbalanced datasets. This paper bridges the theorypractice gap in this context, providing mathematical insights into the training dynamics of LTR scenarios by proposing a theorem stating that, under strong convexity, the learner's weights trained on the full dataset are bounded by those trained only on the Head. We extend this theorem for multiple subsets and introduce a novel perspective of using Continual Learning (CL) for LTR. We sequentially learn the Head and Tail by updating the learner's weights without forgetting the Head using CL methods. We prove that CL reduces loss compared to finetuning on the Tail. Our experiments on MNISTLT and standard LTR benchmarks (CIFAR100LT, CIFAR10LT, and ImageNetLT) validate our theory and demonstrate the effectiveness of CL solutions. We also show the efficacy of CL on realworld data, specifically the Caltech256 dataset, outperforming stateoftheart classifiers. Our work unifies LTR and CL and paves the way for leveraging advances in CL to tackle the LTR challenge effectively. 
Mahdiyar Molahasani · Ali Etemad · Michael Greenspan 🔗 


SimVAE: Narrowing the gap between Discriminative & Generative Representation Learning
(
Poster
)
>
link
Selfsupervised representation learning is a powerful paradigm that leverages the relationship between semantically similar data, such as augmentations, extracts of an image or sound clip, or multiple views/modalities. Recent methods, e.g. SimCLR, CLIP and DINO, have made significant strides, yielding representations that achieve stateoftheart results on multiple downstream tasks. A number of selfsupervised discriminative approaches have been proposed, e.g. instance discrimination, latent clustering and contrastive methods. Though often intuitive, a comprehensive theoretical understanding of their underlying mechanisms or what they learn eludes. Meanwhile, generative approaches, such as variational autoencoders (VAEs), fit a specific latent variable model and have principled appeal, but lag significantly in terms of performance. We present a theoretical analysis of selfsupervised discriminative methods and a graphical model that reflects the assumptions they implicitly make and unifies these methods. We show that fitting this model under an ELBO objective improves representations over previous VAE methods on several common benchmarks, narrowing the gap to discriminative methods, and can also preserve information lost by discriminative approaches. This work brings new theoretical insight to modern machine learning practice. 
Alice Bizeul · Carl Allen 🔗 


Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
(
Poster
)
>
link
Weight decay can significantly impact the optimization dynamics of deep neural networks. In certain situations the effects of weight decay and gradient updates on the magnitude of a parameter vector cancel out on average, forming a state known as equilibrium. This causes the expected rotation of the vector in each update to remain constant along with its magnitude. Importantly, equilibrium can arise independently for the weight vectors of different layers and neurons. These equilibria are highly homogeneous for some optimizer and normalization configurations, effectively balancing the average rotation—a proxy for the effective learning rate—across network components. In this work we explore the equilibrium states of multiple optimizers including AdamW and SGD with momentum, providing insights into interactions between the learning rate, weight decay, initialization, normalization and learning rate schedule. We show how rotational equilibrium can be enforced throughout training, eliminating the chaotic transient phase corresponding to the transition towards equilibrium, thus simplifying the training dynamics. Finally, we show that rotational behavior may play a key role in the effectiveness of AdamW compared to Adam with L2regularization, the performance of different normalization layers, and the need for learning rate warmup. 
Atli Kosson · Bettina Messmer · Martin Jaggi 🔗 


Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate
(
Poster
)
>
link
In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) with \emph{large learning rates}. Under such a training regime, our finding is that, the \emph{oscillation} of the NN weights caused by SGD with large learning rates turns out to be beneficial to the generalization of the NN, potentially improving over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon ``\emph{benign oscillation}". 
Miao Lu · Beining Wu · Xiaodong Yang · Difan Zou 🔗 


On Compositionality and Emergence in Physical Systems Generativie Modeling
(
Poster
)
>
link
The principle of compositionality plays a pivotal role in both machine learning and physical sciences but remains underexplored, particularly in the context of synthetic data derived from physical energy potentials. This study aims to bridge this gap by examining the compositional nature of synthetic datasets generated using composite energy potentials. By combining established LennardJones and Morse potentials into a composite potential, we generate synthetic datasets using Markov Chain Monte Carlo (MCMC) techniques. These datasets serve as training grounds for machine learning models, specifically Neural Ordinary Differential Equations (ODEs). Our primary focus is to investigate whether the properties of the composite datasets retain the characteristics of their individual components, effectively testing the principle of compositionality. The findings not only shed light on the compositional integrity of synthetic physical datasets but also lay the groundwork for more robust and interpretable machine learning models applied to complex physical systems by using the formalism of Category Theory. 
Justin Diamond 🔗 


Escaping Random Teacher Initialization Enhances Signal Propagation and Representations
(
Poster
)
>
link
Recent work shows that, by learning to mimic a random teacher network, student networks land in regions of the loss landscape that lead to better representations (as seen through linearprobing performance). In this paper, we investigate how this phenomenon translates into some concrete properties of the representations. To do so, we first provide a minimal setup that preserves the essence of this phenomenon. Then, we investigate key signal propagation and representation separability properties during random distillation. Our analysis reveals a twostage process: the network first undergoes a form of collapse in its representations, then it is steered to a landscape region that not only allows for better propagation of input signals but also gives rise to wellconditioned representations. 
Felix Sarnthein · Sidak Pal Singh · Antonio Orvieto · Thomas Hofmann 🔗 


The Expressive Power of Transformers with Chain of Thought
(
Poster
)
>
link
Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finitestate machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a "chain of thought" or "scratchpad", i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoderonly transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within contextsensitive languages, and polynomial steps make them recognize exactly the class of polynomialtime solvable problemsthe first exact characterization of a type of transformers in terms of standard complexity classes. Together, our results provide a nuanced framework for understanding how the length of a transformer’s chain of thought or scratchpad impacts its reasoning power. 
William Merrill · Ashish Sabharwal 🔗 


Transformers as MultiTask Feature Selectors: Generalization Analysis of InContext Learning
(
Poster
)
>
link
Transformerbased large language models have displayed impressive capabilities in the domain of incontext learning, wherein they use multiple inputoutput pairs to make predictions on unlabeled test data. To lay the theoretical groundwork for incontext learning, we delve into optimization and generalization of a singlehead, onelayer Transformer in the context of multitask learning for classification. Our investigation uncovers that lower sample complexity is associated with increased trainingrelevant features and reduced noise in prompts, resulting in improved learning performance. The trained model exhibits the mechanism to first attend to demonstrations of trainingrelevant features and then decode the corresponding label embedding. Furthermore, we delineate the conditions necessary for successful outofdomain generalization for incontext learning, specifically regarding the relationship between training and testing prompts. 
Hongkang Li · Meng Wang · Songtao Lu · Hui Wan · Xiaodong Cui · PinYu Chen 🔗 


Fit Like You Sample: SampleEfficient Score Matching From Fast Mixing Diffusions
(
Poster
)
>
link
Score matching is an approach to learning probability distributions parametrized up to a constant of proportionality (e.g. EnergyBased Models). The idea is to fit the score of the distribution (i.e. $\nabla_x \log p(x)$), rather than the likelihood, thus avoiding the need to evaluate the constant of proportionality. While there's a clear algorithmic benefit, the statistical cost can be steep: recent work by (Koehler et al '22) showed that for distributions that have poor isoperimetric properties (a large Poincar'e or logSobolev constant), score matching is substantially statistically less efficient than maximum likelihood. However, many natural realistic distributions, e.g. multimodal distributions as simple as a mixture of two Gaussians in one dimensionhave a poor Poincar'e constant.In this paper, we show a close connection between the mixing time of a broad class of Markov processes with generator $\mathcal{L}$ and stationary distribution $p$, and an appropriately chosen generalized score matching loss that tries to fit $\frac{\mathcal{O} p}{p}$. In the special case of $\mathcal{O} = \nabla_x$, and $\mathcal{L}$ being the generator of Langevin diffusion, this generalizes and recovers the results from (Koehler et al '22). This allows us to adapt techniques to speed up Markov chains to construct better scorematching losses. In particular, "preconditioning" the diffusion can be translated to an appropriate "preconditioning" of the score loss. Lifting the chain by adding a temperature like in simulated tempering can be shown to result in a Gaussianconvolution annealed score matching loss, similar to (SongErmon '19). Moreover, we show that if the distribution being learned is a finite mixture of Gaussians in $d$ dimensions with a shared covariance, the sample complexity of annealed score matching is polynomial in the ambient dimension, the diameter of the means, and the smallest and largest eigenvalues of the covarianceobviating the Poincar'e constantbased lower bounds of the basic score matching loss shown in (Koehler et al '22).

Yilong Qin · Andrej Risteski 🔗 


Towards the Fundamental Limits of Knowledge Transfer over Finite Domains
(
Poster
)
>
link
We characterize the statistical efficiency of knowledge transfer through $n$ samples from a teacher to a probabilistic student classifier with input space $\mathcal{S}$ over labels $\mathcal{A}$. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate $\sqrt{{\mathcal{S}\mathcal{A}}/{n}}$. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ${{\mathcal{S}\mathcal{A}}/{n}}$. However, under this second data acquisition protocol, minimizing a naive adaptation of the crossentropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on $\mathcal{A}$ given every sampled input, thereby provably enables the student to enjoy a rate ${\mathcal{S}}/{n}$ free of $\mathcal{A}$. We find any KullbackLeibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.

Qingyue Zhao · Banghua Zhu 🔗 


Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
(
Poster
)
>
link
We identify a new phenomenon in neural network optimization which arises from the interaction of depth and a particular heavytailed structure in natural data. Our result offers intuitive explanations for several previously reported observations about network training dynamics, including a conceptually new cause for progressive sharpening and the edge of stability. It also provides a new lens through which to theoretically study and improve modern stochastic optimization on neural nets. Experimentally, we demonstrate the significant influence of paired groups of outliers in the training data with strong \emph{opposing signals}: consistent, large magnitude features which dominate the network output and occur in both groups with similar frequency. Due to these outliers, early optimization enters a narrow valley which carefully balances the opposing groups; subsequent sharpening causes their loss to rise rapidly, oscillating between high on one group and then the other, until the overall loss spikes. We complement these experiments with a theoretical analysis of a twolayer linear network on a simple model of opposing signals. Our finding enables new predictions of training behavior which we confirm experimentally. 
Elan Rosenfeld · Andrej Risteski 🔗 


MoXCo:How I learned to stop exploring and love my local minima?
(
Poster
)
>
link
Deep Neural Networks (DNNs) are wellknown for their generalization capabilities despite overparameterization. This is commonly attributed to the optimizer’s ability to find “good” solutions within highdimensional loss landscapes. However, widely employed adaptive optimizers, such as ADAM, may suffer from subpar generalization. This paper presents an innovative methodology, $\textit{MoXCo}$, to address these concerns by designing adaptive optimizers that not only expedite exploration with faster convergence speeds but also ensure the avoidance of overexploitation in specific parameter regimes, ultimately leading to convergence to good solutions.

Esha Singh · Shoham Sabach · YuXiang Wang 🔗 


Firstorder ANIL provably learns representations despite overparametrisation
(
Poster
)
>
link
Metalearning methods leverage data from previous tasks to learn a new task in a sampleefficient manner. In particular, modelagnostic methods look for initialisation points from which gradient descent quickly adapts to any new task.Although it has been empirically suggested that such methods learns shared representations during pretraining, there is limited theoretical evidence of such behavior. In this direction, this work shows, in the limit of infinite tasks, firstorder ANIL with a linear twolayer network successfully learns linear shared representations. This result even holds under overparametrisation; having a width larger than the dimension of the shared representations results in an asymptotically lowrank solution. 
Oguz Kaan Yuksel · Etienne Boursier · Nicolas Flammarion 🔗 


A DataDriven Measure of Relative Uncertainty for Misclassification Detection
(
Poster
)
>
link
Misclassification detection is an important problem in machine learning, as it allows for the identification of instances where the model's predictions are unreliable. However, conventional uncertainty measures such as Shannon entropy do not provide an effective way to infer the real uncertainty associated with the model's predictions. In this paper, we introduce a novel datadriven measure of relative uncertainty to an observer for misclassification detection. By learning patterns in the distribution of softpredictions, our uncertainty measure can identify misclassified samples based on the predicted class probabilities. Interestingly, according to the proposed measure, softpredictions that correspond to misclassified instances can carry a large amount of uncertainty, even though they may have low Shannon entropy. We demonstrate empirical improvements over multiple image classification tasks, outperforming stateoftheart misclassification detection methods. 
Eduardo Dadalto Câmara Gomes · Marco Romanelli · Georg Pichler · Pablo Piantanida 🔗 


NonVacuous Generalization Bounds for Large Language Models
(
Poster
)
>
link
Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first nonvacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded loglikelihood loss, and we extend the bound to handle subsampling, accelerating bound computation on massive datasets. To achieve the extreme level of compression required for nonvacuous generalization bounds, we devise SubLoRA, a lowdimensional nonlinear parameterization. Using this approach, we find that larger models have better generalization bounds and are more compressible than smaller models. 
Sanae Lotfi · Marc Finzi · Yilun Kuang · Tim G. J. Rudner · Micah Goldblum · Andrew Wilson 🔗 


Learning from setbacks: the impact of adversarial initialization on generalization performance
(
Poster
)
>
link
The loss landscape of stateoftheart neural networks is far from simple. Understanding how optimization algorithms initialized differently navigate such highdimensional nonconvex profiles is a key problem in machine learning. [Liu et al. 2020] use pretraining on random labels to produce adversarial initializations that lead stochastic gradient descent into global minima with poor generalization. This result contrasts with other literature arguing that pretraining on random labels produces positive effects (see, e.g., [Maennel et al. (2020)]). We ask under which conditions this initialization results in solutions that generalize poorly. Our goal is to build a theoretical understanding of the properties of good solutions by isolating this phenomenon in some minimal models. To this end, we posit and study several hypotheses for why the phenomenon might arise in models of varying levels of simplicity, including representation quality and complex structure in data. 
Yatin Dandi · Stefani Karp · Francesca Mignacco · Kavya Ravichandran 🔗 


Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
(
Poster
)
>
link
We study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization.We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR10 and ImageNet. Furthermore, using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a welldefined feature learning joint infinitewidth and infinitedepth limit and show convergence of finitesize network dynamics towards this limit.

Blake Bordelon · Lorenzo Noci · Mufan Li · Boris Hanin · Cengiz Pehlevan 🔗 


Estimating optimal PACBayes bounds with Hamiltonian Monte Carlo
(
Poster
)
>
link
An important yet underexplored question in the PACBayes literature is how much tightness we lose by restricting the posterior family to factorized Gaussian distributions when optimizing a PACBayes bound. We investigate this issue by estimating dataindependent PACBayes bounds using the optimal posteriors, comparing them to bounds obtained using MFVI. Concretely, we (1) sample from the optimal Gibbs posterior using Hamiltonian Monte Carlo, (2) estimate its KL divergence from the prior with thermodynamic integration, and (3) propose three methods to obtain highprobability bounds under different assumptions. Our experiments on the MNIST dataset reveal significant tightness gaps, as much as 56% in some cases. 
Szilvia Ujváry · Gergely Flamich · Vincent Fortuin · José Miguel HernándezLobato 🔗 


Divergence at the Interpolation Threshold: Identifying, Interpreting \& Ablating the Sources of a Deep Learning Puzzle
(
Poster
)
>
link
Machine learning models misbehave, often in unexpected ways. One prominent misbehavior is when the test loss diverges at the interpolation threshold, perhaps best known from its distinctive appearance in double descent. While considerable theoretical effort has gone into understanding generalization of overparameterized models, less effort has been made at understanding why the test loss misbehaves at the interpolation threshold. Moreover, analytically solvable models in this area employ a range of assumptions and use complex techniques from random matrix theory, statistical mechanics, and kernel methods, making it difficult to assess when and why test error might diverge. In this work, we analytically study the simplest supervised model  ordinary linear regression  and show intuitively and rigorously when and why a divergence occurs at the interpolation threshold using basic linear algebra. We identify three interpretable factors that, when all present, cause the divergence. We demonstrate on real data that linear models' test losses diverge at the interpolation threshold and that the divergence disappears when we ablate any one of the three identified factors. 
Rylan Schaeffer · Zachary Robertson · Akhilan Boopathy · Mikail Khona · Ila Fiete · Andrey Gromov · Sanmi Koyejo 🔗 


Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult
(
Poster
)
>
link
Large learning rates, when applied to gradient descent for nonconvex optimization, yield various implicit biases including edge of stability (Cohen et al., 2021), balancing (Wang et al., 2022), and catapult (Lewkowycz et al., 2020). These phenomena cannot be well explained by classical optimization theory. Significant theoretical progress has been made to understand these implicit biases, but it remains unclear for which objective functions would they occur. This paper provides an initial step in answering this question, showing that these implicit biases are different tips of the same iceberg. Specifically, they occur when the optimization objective function has certain regularity. This regularity, together with gradient descent using a large learning rate that favors flatter regions, result in these nontrivial dynamical behaviors. To demonstrate this claim, we develop new global convergence theory under large learning rates for two examples of nonconvex functions without global smoothness, departing from typical assumptions in traditional analyses. We also discuss the implications on training neural networks, where different losses and activations can affect regularity and lead to highly varied training dynamics. 
Yuqing Wang · Zhenghao Xu · Tuo Zhao · Molei Tao 🔗 


Toward Studentoriented Teacher Network Training for Knowledge Distillation
(
Poster
)
>
link
How to conduct teacher training for knowledge distillation is still an open problem. It has been widely observed that a bestperforming teacher does not necessarily yield the bestperforming student, suggesting a fundamental discrepancy between the current teacher training practice and the ideal teacher training strategy. To fill this gap, we explore the feasibility of training a teacher that is oriented toward student performance with empirical risk minimization (ERM). Our analyses are inspired by the recent findings that the effectiveness of knowledge distillation hinges on the teacher’s capability to approximate the true label distribution of training inputs. We theoretically establish that ERM minimizer can approximate the true label distribution of training data as long as the feature extractor of the learner network is Lipschitz continuous and is robust to feature transformations. In light of our theory, we propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM. Experiments on benchmark datasets using various knowledge distillation algorithms and teacherstudent pairs confirm that SoTeacher can improve student accuracy consistently. 
Chengyu Dong · Liyuan Liu · Jingbo Shang 🔗 


Adaptive SharpnessAware Pruning for Robust Sparse Networks
(
Poster
)
>
link
Robustness and compactness are two essential attributes of deep learning models that are deployed in the real world. The goals of robustness and compactness may seem to be at odds, since robustness requires generalization across domains, while the process of compression exploits specificity in one domain. We introduce \textit{Adaptive SharpnessAware Pruning (AdaSAP)}, which unifies these goals through the lens of network sharpness. The AdaSAP method produces sparse networks that are robust to input variations which are \textit{unseen at training time}. We achieve this by strategically incorporating weight perturbations in order to optimize the loss landscape. This allows the model to be both primed for pruning and regularized for improved robustness. AdaSAP improves the robust accuracy of pruned models on classification and detection over recent methods by up to +6\% on OOD datasets, over a wide range of compression ratios, pruning criteria, and architectures. 
Anna Bair · Hongxu Yin · Maying Shen · Pavlo Molchanov · Jose M. Alvarez 🔗 


Invariant LowDimensional Subspaces in Gradient Descent for Learning Deep Matrix Factorizations
(
Poster
)
>
link
An extensively studied phenomenon of the past few years in training deep networks is the implicit bias of gradient descent towards parsimonious solutions. In this work, we further investigate this phenomenon by narrowing our focus to deep matrix factorization, where we reveal surprising lowdimensional structures in the learning dynamics when the target matrix is lowrank. Specifically, we show that the evolution of gradient descent starting from arbitrary orthogonal initialization only affects a minimal portion of singular vector spaces across all weight matrices. In other words, the learning process happens only within a small invariant subspace of each weight matrix, despite the fact that all parameters are updated throughout training. From this, we provide rigorous justification for lowrank training in a specific, yet practical setting. In particular, we demonstrate that we can construct compressed factorizations that are equivalent to fullwidth, deep factorizations throughout training for solving lowrank matrix completion problems efficiently. 
Can Yaras · Peng Wang · Wei Hu · Zhihui Zhu · Laura Balzano · Qing Qu 🔗 


How Structured Data Guides Feature Learning: A Case Study of the Parity Problem
(
Poster
)
>
link
Recent works have shown that neural networks optimized by gradientbased methods can adapt to sparse or lowdimensional target functions through feature learning; an often studied target is classification of the sparse parity function on the unit hypercube. However, such isotropic data setting does not capture the anisotropy and low intrinsic dimensionality exhibited in realistic datasets. In this work, we address this shortcoming by studying how feature learning interacts with structured (anisotropic) input data: we consider the classification of sparse parity on highdimensional orthotope where the feature coordinates have varying magnitudes. Specifically, we analyze the learning complexity of the meanfield Langevin dynamics (MFLD), which describes the noisy gradient descent update on twolayer neural network, and show that the statistical complexity (i.e. sample size) and computational complexity (i.e. network width) of MFLD can both be improved when prominent directions of the anisotropic input data aligns with the support of the target function. Moreover, we demonstrate the benefit of feature learning by establishing a kernel lower bound on the classification error, which applies to neural networks in the lazy regime. 
Atsushi Nitanda · Kazusato Oko · Taiji Suzuki · Denny Wu 🔗 


The Next Symbol Prediction Problem: PAClearning and its relation to Language Models
(
Poster
)
>
link
The next symbol prediction (NSP) problem has been widely used to empirically evaluate the performance of neural sequence models on formal language tasks. We formalize the setting so as to make it amenable to PAClearning analysis. In the NSP setting, a learning algorithm receives valid sequences (positive examples) from the underlying language, along with rich labels indicating for every prefix, whether the prefix is in the language and what symbols could appear subsequently that lead to accepting string. In the conventional classification setting where learning occurs with only positive and negative examples, the problem of learning regular languages or even subclasses represented by acyclic DFAs is known to be computationally hard based on cryptographic assumptions. In contrast, our main result shows that regular languages are efficiently PAClearnable in the next symbol prediction setting. Further, we provide a more efficient learning algorithm for the case where the target DFA is known to be acyclic. Given the rich labels required in the NSP setting, one may wonder whether this setting is applicable to nonartificial tasks. We explain how language models can act as a source of such labeled data, and consequently, our algorithm can be applied to fit a finitestate model (DFA) that learns the (truncated) support of the language model. 
Satwik Bhattamishra · Phil Blunsom · Varun Kanade 🔗 


Why Do We Need Weight Decay for Overparameterized Deep Networks?
(
Poster
)
>
link
Weight decay is a broadly used technique for training stateoftheart deep networks. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the everpresent implicit regularization of SGD via loss stabilization. 
Maksym Andriushchenko · Francesco D'Angelo · Aditya Vardhan Varre · Nicolas Flammarion 🔗 


The DoubleEdged Sword: Perception and Uncertainty in Inverse Problems
(
Poster
)
>
link
Inverse problems pose significant challenges due to their inherent ambiguity in mapping observed data back to its original state. While recent advances have yielded impressive results in restoring degraded data, attaining high perceptual quality comes at the cost of increased hallucinations. This paper investigates this phenomenon to reveal a fundamental tradeoff between perception and uncertainty in solving inverse problems. Using error entropy as a measure of uncertainty, we demonstrate that higher perceptual quality in restoration algorithms is accompanied by a surge in uncertainty. Leveraging Rényi divergence as a perception metric, we derive bounds for this tradeoff, allowing for categorization of different inverse methods based on their performance. Additionally, we connect estimation distortion with uncertainty, offering novel insights into the traditional perceptiondistortion tradeoff. Our work provides a rigorous framework for analyzing uncertainty in the context of solving inverse problems, highlighting its interplay with perception and distortion, while underscoring the limitations of current approaches to achieving both high perceptual quality and low uncertainty simultaneously. 
Regev Cohen · Ehud Rivlin · Daniel Freedman 🔗 


NearInterpolators: Fast Norm Growth and Tempered NearOverfitting
(
Poster
)
>
link
We study linear regression when the input data populationcovariance matrix has eigenvalues $\lambda_i \sim i^{\alpha}$ where $\alpha > 1$.Under a generic random matrix theory assumption, we provethat any nearinterpolator, i.e., ${\beta}$ whose training error is below the noise floor, must have its squared $\ell_2$norm growing superlinearly with the number of samples $n$:$\{\beta}\_{2}^{2} = \Omega(n^{\alpha})$. This implies that existing normbased generalization bounds increase as the number of samples increases, matching the empirical observations from prior work.On the other hand, such nearinterpolators when properly tuned achieve good generalization, where the test errors approach arbitrarily close to the noise floor.Our work demonstrates that existing normbased generalization bounds are vacuous for explainingthe generalization capability of \emph{any} nearinterpolators.Moreover, we show that the tradeoff between train and test accuracy is better when the norm growth exponential is smaller.

Yutong Wang · Rishi Sonthalia · Wei Hu 🔗 


On robust overfitting: adversarial training induced distribution matters
(
Poster
)
>
link
Robust overfitting has been observed to arise in adversarial training. We hypothesize that this phenomenon may be related to the evolution of the data distribution along the training trajectory. To investigate this, we select a set of checkpoints in adversarial training and perform standard training on distributions induced by adversarial perturbation w.r.t the checkpoints. We observe that the obtained models become increasingly harder to generalize when robust overfitting occurs, thereby validating the hypothesis. We show the hardness of generalization on the induced distributions is related to certain local property of the perturbation operator at each checkpoint. The connection between the local property and the generalization on the induced distribution is proved by establishing an upper bound of the generalization error. Other interesting phenomena related to the adversarial training trajectory are also observed. 
Runzhi Tian · Yongyi Mao 🔗 


Are Graph Neural Networks Optimal Approximation Algorithms?
(
Poster
)
>
link
In this work we design graph neural network architectures that capture optimal approximation algorithms for a large class of combinatorial optimization problems using powerful algorithmic tools from semidefinite programming (SDP). Concretely, we prove that polynomialsized message passing algorithms can represent the most powerful polynomial time algorithms for Max Constraint Satisfaction Problems assuming the Unique Games Conjecture. We leverage this result to construct efficient graph neural network architectures, OptGNN, that obtain highquality approximate solutions on landmark combinatorial optimization problems such as Max Cut and Minimum Vertex Cover. Finally, we take advantage of OptGNN's ability to capture convex relaxations to design an algorithm for producing dual certificates of optimality (bounds on the optimal solution) from the learned embeddings of OptGNN. 
Morris Yau · Eric Lu · Nikolaos Karalias · Jessica Xu · Stefanie Jegelka 🔗 


JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
(
Poster
)
>
link
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the selfattention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection), and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from realworld dataset (Wikitext2/Wikitext103) and various pretrained models (OPT, Pythia) verify our theoretical findings. 
Yuandong Tian · Yiping Wang · Zhenyu Zhang · Beidi Chen · Simon Du 🔗 