Skip to yearly menu bar Skip to main content


Poster

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Junxiong Wang · Daniele Paliotta · Avner May · Alexander Rush · Tri Dao

East Exhibit Hall A-C #2103
[ ]
Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Recent research suggests that state-space models (SSMs) like Mamba can be competitive with Transformer models for language modeling with advantageous deployment characteristics. Given the focus and expertise on training large-scale Transformer models, we consider the challenge of converting these pretrained models into SSMs for deployment. We demonstrate that it is feasible to distill large Transformers into SSMs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of state-space models. Overall we show how, with limited computation resources, we can distill a large Transformer into a hybrid SSM and decode it efficiently.

Live content is unavailable. Log in and register to view live content