Training Dynamics Impact Quantization Degradation
Abstract
Despite its widespread use, little is understood about what makes large language models more — or less — robust to quantization.To address this question, we study the degradation induced by quantization in language modeling, analyzing open-source training trajectories of models up to 3 billion parameters and 11 trillion tokens, and validate our analysis by pretraining 160M-parameter models on up to 100B tokens. Our findings reveal that, contrary to previous work, post-training quantization robustness is driven by a complex interplay between learning rate decay and validation loss. In particular, as learning rate decays, validation loss and quantization error diverge, mostly independent of the amount of training data. Finally, we present two examples of interventions on the training dynamics that modulate quantization error, sometimes favorably. Namely, (1) for comparable validation loss, higher learning rates can lead to smaller quantization error; (2) weight averaging approximates learning rate decay favorably in some settings.