QAQ: Query-adaptive Mixed-precision Quantization for Large Language Models
Shuxing Li · Huanrong Liu · Zelin Wang · Ruoyang Du · S Lee · Chunlin Tian · Qingbiao Li
Abstract
Large language models (LLMs) achieve strong performance, yet inference is still bounded by trade-offs between efficiency and accuracy. While quantization cuts memory and latency, it fails to flexibly accommodate heterogeneous inputs. We introduce Query-Aware Quantization (QAQ), a dynamic-precision scheme that decomposes model weights into bit-planes, employs a trainable router for query-conditioned precision selection, and supports on-demand CPU$\leftrightarrow$GPU loading. On Qwen3 and LLaMA-3.1, QAQ matches the accuracy of 8-bit baselines while reducing GPU memory footprint, with an associated latency overhead. These results suggest that QAQ offers a practical operating point on the efficiency–accuracy frontier for LLM inference.
Chat is not available.
Successful Page Load