Skip to yearly menu bar Skip to main content


Poster

Loki: Low-rank Keys for Efficient Sparse Attention

Prajwal Singhania · Siddharth Singh · Shwai He · Soheil Feizi · Abhinav Bhatele

East Exhibit Hall A-C #2000
[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Inference on large language models (LLMs) can be expensive in terms of thecompute and memory costs involved, especially when long sequence lengths areused. In particular, the self-attention mechanism used in LLM inference contributessignificantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximateself-attention by focusing on the dimensionality of key vectors computed in theattention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting thisobservation, we propose Loki, a novel sparse attention method that ranks and selectstokens in the KV-cache based on attention scores computed in low-dimensionalspace. Our evaluations show that Loki is able to speed up the attention computationdue to reduced data movement (load/store) and compute costs while maintainingthe efficacy of the models better than other popular approximation methods.

Live content is unavailable. Log in and register to view live content