Spotlight Poster
MInference: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang · Yucheng LI · Chengruidong Zhang · Qianhui Wu · Xufang Luo · Surin Ahn · Zhenhua Han · Amir Abdi · Dongsheng Li · Chin-Yew Lin · Yuqing Yang · Lili Qiu
East Exhibit Hall A-C #2401
The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the prefilling stage) on a single A100 GPU. Existing methods for speeding up pre-filling often fail to maintain acceptable accuracy or efficiency when applied to longcontext LLMs. To address this gap, we introduce MInference, a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices—the A-shape pattern, Vertical-Slash pattern, and Block-Sparse pattern—that can be leveraged for efficient sparse computation on GPUs. During inference, we decide the optimal pattern for each attention head and dynamically build sparse indices based on the assigned pattern. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-8B and Yi-9B-200k, we demonstrate that MInference effectively reduces inference latency by up to 10× for pre-filling on an A100 GPU, while maintaining accuracy.
Live content is unavailable. Log in and register to view live content