NeurIPS Poster Understanding and Minimising Outlier Features in Transformer Training

Poster

Understanding and Minimising Outlier Features in Transformer Training

Bobby He · Lorenzo Noci · Daniele Paliotta · Imanol Schlag · Thomas Hofmann

East Exhibit Hall A-C #2402

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.Our work focuses on the above questions, first identifying several quantitative metrics, such as the kurtosis over neuron activation norms, to measure OFs. With these metrics, we study how architectural and optimisation choices influence OFs, and provide practical insights to minimise OFs during training. As highlights, we introduce a novel unnormalised transformer block, the Outlier Protected block, and present a previously unknown benefit of non-diagonal preconditioning optimisers, finding both approaches to significantly reduce OFs and improve quantisation without compromising convergence speed, at scales of up to 7B parameters. Notably, our combination of OP block and non-diagonal preconditioner (SOAP) achieves 14.87 weight-and-activation int8 perplexity (from 14.71 in standard precision), compared to 63.4 int8 perplexity (from 16.00) with a default OF-prone combination of Pre-Norm model and Adam, when quantising OPT-125m models post-training.

Chat is not available.