SGDKV: Summarization Guided KV Cache Compression
Abstract
Large language models face memory bottlenecks in long-context scenarios due to linearly growing key-value (KV) cache requirements. Existing KV cache compression methods often rely on simple heuristics, failing to discern the functional roles of different attention roles.We introduce SGDKV (Summarization-Guided KV Cache compression), a head-aware framework that leverages a novel chunk-summarization diagnostic task, to systematically identify and prioritize the heads specialized in hierarchical information aggregation. Experiments on Qwen2.5-7B-1M and Qwen3-32B models across various benchmarks show SGD-KV achieves SOTA performance with contexts up to 1M tokens while reducing KV cache memory by up to 75\%. Our findings demonstrate that strategically allocate the KV cache budget based on the summarization score distribution of attention heads offers a superior efficiency-accuracy tradeoff for long context inference.