Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien · Charles-Étienne Joseph · Zain Sarwar · Ashwinee Panda · Anirban Das · Shi-Xiong Zhang · Stephen Rawls · Sambit Sahu · Eugene Belilovsky · Irina Rish
Abstract
Prior work has shown that a simple combination of replay and learning rate re-warming and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) do the MoE transformer's routers exacerbate forgetting relative to a dense model?; 2) do the routers maintain a balanced load on previous distributions after CPT?; 3) are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs? In what follows, we conduct a large-scale ($>2$B parameter switch and DeepSeek MoE LLMs trained for $600$B tokens) empirical study across four MoE transformers to answer these questions. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.
Video
Chat is not available.
Successful Page Load