NeurIPS SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

KeyNote Talk
in
Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Song Han

[ Abstract ]

Abstract:

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy due to outliers or do not run efficiently on hardware. I’ll present SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs, including OPT-175B, BLOOM-176B and GLM-130B, achieving faster inference speed with half the number of GPUs. We hope SmoothQuant can inspire economic deployment of LLMs in the future.

Chat is not available.

KeyNote Talk in Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Song Han

KeyNote Talk
in
Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)