E-Tamba: Efficient Transformer-Mamba Layer Transplantation
Abstract
With the growing popularity of Transformer and State Space Models (SSMs), hybrid designs like Jamba and RecurrentGemma have gained significant attention for their abilities to integrate the long-context processing strengths of Transformers with the low-memory demands of SSMs. However, most hybrid models require extensive pre-training, making them inaccessible to researchers with limited resources who want to experiment with different model architectures. To address this challenge, we introduce E-Tamba, a novel method for constructing hybrid models through only fine-tuning pre-trained Transformer and SSM models. Using layer-wise importance analysis, E-Tamba-1.1B replaces the non-critical upper Transformer layers of Pythia-1.4B with key layers from Mamba-1.4B. Following only 0.9B tokens of fine-tuning, E-Tamba-1.1B delivers excellent results in perplexity scores and various NLP downstream tasks. Additionally, it achieves a 3X reduction in inference memory compared to the baseline Pythia-1.4B, while offering superior long-context retrieval capabilities over Mamba-1.4B.