Timezone: »

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
AJAY JAISWAL · Shiwei Liu · Tianlong Chen · Zhangyang "Atlas" Wang

Wed Dec 13 03:00 PM -- 05:00 PM (PST) @ Great Hall & Hall B1+B2 #2008
Large pre-trained transformers are $\textit{show-stealer}$ in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive $\textit{train-prune-retrain}$ routine of iterative magnitude pruning (IMP) which worsens with increasing model size. In this paper, we comprehensively study $\textit{induced sparse patterns}$ across multiple large pre-trained vision and language transformers. We propose the existence of -- $\textbf{essential sparsity}$ defined with a $\textbf{sharp dropping point}$ beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in $\textbf{one-shot}$. We also present an intriguing emerging phenomenon of $\textbf{abrupt sparsification}$ during the pre-training of BERT, i.e., BERT suddenly becomes heavily sparse in pre-training after certain iterations. Moreover, our observations also indicate a $\textbf{counter-intuitive}$ finding that BERT trained with a larger amount of pre-training data tends to have a better ability to condense knowledge in comparatively relatively fewer parameters. Lastly, we investigate the effect of the pre-training loss on essential sparsity and discover that self-supervised learning (SSL) objectives trigger stronger emergent sparsification properties than supervised learning (SL). All our codes will be publicly available.

Author Information

AJAY JAISWAL (The University of Texas, Austin)
Shiwei Liu (UT Austin)

I am a third-year Ph.D. student in the Data Mining Group, Department of Mathematics and Computer Science, Eindhoven University of Technology (TU/e). My current research topics are dynamic sparse training, sparse neural networks, pruning, the generalization of neural networks, etc. I am looking for a postdoc position in machine learning.

Tianlong Chen (MIT/Harvard/UNC Chapel Hill)
Zhangyang "Atlas" Wang (University of Texas at Austin)

More from the Same Authors