Efficiency for Reasoning and Reasoning for Efficiency
Abstract
There is an interesting tension between reasoning models and long-context tasks. The current state-of-the-art long-context models are reasoning models, but for reasoning models at the 8B parameter scale that struggle with intensive reasoning tasks, their long chains-of-thought (CoTs) can put strain on the key-value (KV) cache, which can result in performance degradations for certain tasks. This paper seeks to answer the question: can we use efficient inference techniques to make reasoning models more Pareto-optimal in memory, latency, and performance? We find that NF4 and Int8 weight quantization strongly outperform baselines and token eviction methods on Pareto efficiency for both reasoning and non-reasoning models. Additionally, we find that token eviction methods for KV cache compression struggle with tasks that rely heavily on in-context learning (ICL), passkey retrieval, long output, and long-context reasoning because they tend to eject critical tokens. However, we find that reasoning models can help recover the performance degradation of token eviction for ICL tasks, achieving full performance even with smaller cache sizes, thus advancing the Pareto frontier. In short, efficient inference techniques can cut costs for reasoning models, and reasoning can improve performance for efficient inference techniques. Our findings underscore the importance of evaluating different types of efficient inference techniques in relation to each other. More broadly, our work raises the question: do the advantages of long CoT reasoning on long-context tasks hold for smaller reasoning models? Or, do long CoTs sometimes carry more burden than worth?