Poster
Gated Slot Attention for Efficient Linear-Time Sequence Modeling
Yu Zhang · Songlin Yang · Rui-Jie Zhu · Yue Zhang · Leyang Cui · Yiqiao Wang · Bolun Wang · Freda Shi · Bailin Wang · Wei Bi · Peng Zhou · Guohong Fu
East Exhibit Hall A-C #1801
Large language models (LLMs) are known to suffer from quadratic training complexity and memory-intensive key-value (KV) cache management during inference, when standard attention is applied. Linear attention has been emerging as a promising alternative, replacing unbounded KV memories with fixed-capacity hidden states.However, its implementations often fall short of the performance of Llama-like architectures.In this paper, we introduce Gated Slot Attention (GSA), an approach for linearized sequence modeling that combines the principles of standard attention with data-dependent gated linear attention.With better memory management, LLMs trained with GSA should perform better than those with previous linear attention designs.More importantly, GSA could be efficiently parallelized on modern hardware, enabling large-scale experiments. We verify the effectiveness and efficiency of GSA by training models with 1.3B and 2.7B parameters from scratch, and the experimental results demonstrate its competitive performance on a wide range of benchmarks.Furthermore, we present preliminary results in continual training from the Mistral 7B checkpoint, showing that GSA exhibits better compatibility with existing well-trained standard attention-based LLMs than other linear attention variants.
Live content is unavailable. Log in and register to view live content