Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
Abstract
The quadratic cost of attention hinders the scalability of long-context LLMs, particularly in resource-constrained settings. While attention is often sparse, existing static sparse methods such as sliding windows or global tokens cannot adapt to content-dependent variations in attention. In contrast, dynamic approaches improve flexibility but still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and may prune tokens that remain contextually important, limiting their accuracy across diverse tasks. As such, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply a length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to the token level to produce importance scores that determine which token-level interactions are preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 25–45% and peak memory usage by 30–35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (yielding 6–18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.An anonymous project repository is available at https://drive.google.com/drive/folders/1AVdQOfCqRPYNNBzcfiSw1r-lBpRKO9Uy?usp=sharing