GEAR-X: Expanders for Next-Gen KV Cache Compression
Abstract
Large Reasoning Models (LRMs) use Key-Value (KV) caching to speed up autoregressive decoding by reusing previously computed attention states for long contexts. However, KV caches grow linearly with sequence length, quickly saturating GPU memory and becoming a bottleneck for long-context reasoning. Prior work, such as GEAR (GEnerative Inference with Approximation Error Reduction), compresses KV caches by combining low-bit quantization, sparse outlier handling, and low-rank approximation. We propose GEAR-X, a drop-in modification that replaces unstructured magnitude-based outlier selection with structured sparsity via expander graphs. This design provides spectral guarantees to preserve connectivity and information flow under aggressive compression, improving the fidelity of the compressed cache without retraining. Our preliminary experiments on GSM8k, AQuA, and BBH benchmarks show that GEAR-X can achieve competitive or improved accuracy compared to standard GEAR, while maintaining significant memory savings.