Poster
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
xingkui zhu · Yiran Guan · Dingkang Liang · Yuchao Chen · Yuliang Liu · Xiang Bai
East Exhibit Hall A-C #1310
The sparsely activated mixture of experts (MoE) model presents an effective alternative to densely activated (dense) models, combining improved accuracy with computational efficiency. However, training MoE models from scratch requires extensive data and computational resources, a challenge that limits their widespread adoption. To address this, we introduce MoE Jetpack, a framework designed to fine-tune the abundant and easily accessible dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which initializes MoE models with dense checkpoints to accelerate convergence and enhance accuracy, minimizing the need for extensive pre-training; (2) the hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture to enhance fine-tuning performance and efficiency.Experimental results indicate that MoE Jetpack doubles the convergence speed and enhances accuracy by 2.8% on ImageNet-1K. On smaller datasets, it achieves up to 8-fold faster convergence and over 30% accuracy gains, highlighting its efficiency.The code is available at https://github.com/Adlith/MoE-Jetpack.
Live content is unavailable. Log in and register to view live content