Optimizing Memory vs. Accuracy in Reasoning Models Cannot Be Scale-Agnostic
Junhyuck Kim · Ethan Ewer · Taehong Moon · Jongho Park · Dimitris Papailiopoulos
Abstract
Current scaling laws on the precision-performance trade-off largely assess zero-shot accuracy and consider only model weight precision, offering limited guidance for modern reasoning models in which the KV cache may dominate memory. Conversely, existing test-time scaling works often focus on FLOPs, failing to capture the practical benefits of model and cache compression. In this work, under a fixed memory budget, we jointly study the trade-offs between model scale, weight precision, KV cache compression, generation token budget, and sample size for parallel scaling. Contrary to zero-shot settings, where 4-bit is memory-optimal across scales, we find that memory-optimal reasoning is scale-dependent. (i) For small models ($\le$4B), the optimal strategy across precision is to increase model size rather than extend the token budget until saturation. (ii) Parallel scaling via majority voting is memory-efficient only for models effectively larger than or equal to 8-bit 4B. (iii) Furthermore, optimal weight precision is also task-dependent: higher precision is critical for logical reasoning, while 4-bit remains effective for knowledge-intensive tasks. (iv) Finally, we demonstrate that while both KV cache quantization and eviction are effective across scale and precision, eviction offers a better memory trade-off for small models. We believe our findings provide practitioners with better guidance on how to deploy reasoning models efficiently.
Chat is not available.
Successful Page Load