Skip to yearly menu bar Skip to main content


Poster

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Songlin Yang · Bailin Wang · Yu Zhang · Yikang Shen · Yoon Kim

East Exhibit Hall A-C #2009
[ ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Transformers with linear attention (i.e., linear Transformers) and state-space models have recently been suggested as a viable linear-time alternative to Transformers with softmax attention. However, these models still underperform Transformers especially on recall-intensive tasks. While more expressive variants of linear Transformers which replace the additive outer-product update in linear Transformers with the delta rule have been found to be more effective at associative recall, existing algorithms for training such models are hardware-inefficient and thus difficult to scale. This work describes a hardware-efficient algorithm for training a generalized variant of linear Transformers (of which DeltaNet is a special case) which exploits the WY representation for computing products of Householder matrices. This algorithm allows us to scale DeltaNet to moderate-scale language modeling settings (1.3B models trained for 100B tokens), where we find that it outperforms strong linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks (including tasks that focus on recall).

Live content is unavailable. Log in and register to view live content