Recycling the World Computer: Fault-Tolerant LLM Training on Idle GPU Capacity
Jason Mancuso
Abstract
Modal runs a global, multi-cloud fleet of worker instances to provide our users sub-second access to autoscaled, GPU-backed compute. Fulfilling this commitment to our users requires running some amount of buffer capacity to account for bursty workloads. We present early results leveraging our buffer capacity for distributed training of large-language models with a fault-tolerant implementation of DiLoCo.
Successful Page Load