Poster
in
Workshop: ML for Systems

MoE-GPS: Guidlines for Prediction Strategy with Expert Duplication in MoE Load Balancing

Haiyue Ma · Zhixu Du · Yiran Chen

Project Page [ OpenReview]

Abstract

In multi-GPU Mixture-of-Experts (MoE) networks, distributing experts across GPUs leads to load imbalance as token assignments vary. Recent methods address this by duplicating popular experts on additional GPUs, requiring accurate prediction of token distributions before routing. This paper examines the tradeoffs between prediction strategy, accuracy, overhead, and system performance. We introduce MoE-GPS, a framework that quantifies these impacts and identifies optimal predictor designs for various system settings. Our results highlight Distribution-Only Prediction, which predicts coarse token distribution with much lower overhead than Token-to-Expert Prediction, achieving 23\% faster inference on the Mixtral 8×7B MMLU dataset.

Chat is not available.