MoE-GPS: Guidlines for Prediction Strategy with Expert Duplication in MoE Load Balancing
Haiyue Ma · Zhixu Du · Yiran Chen
Abstract
In multi-GPU Mixture-of-Experts (MoE) networks, distributing experts across GPUs leads to load imbalance as token assignments vary. Recent methods address this by duplicating popular experts on additional GPUs, requiring accurate prediction of token distributions before routing. This paper examines the tradeoffs between prediction strategy, accuracy, overhead, and system performance. We introduce MoE-GPS, a framework that quantifies these impacts and identifies optimal predictor designs for various system settings. Our results highlight Distribution-Only Prediction, which predicts coarse token distribution with much lower overhead than Token-to-Expert Prediction, achieving 23\% faster inference on the Mixtral 8×7B MMLU dataset.
Chat is not available.
Successful Page Load