Poster
in
Workshop: Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization
A Quadratic Synchronization Rule for Distributed Deep Learning
Xinran Gu · Kaifeng Lyu · Sanjeev Arora · Jingzhao Zhang · Longbo Huang
Abstract:
In distributed deep learning with data parallelism, synchronizing gradients at each training step can cause a huge communication overhead, especially when many nodes work together to train large models. Local gradient methods, such as Local SGD, address this issue by allowing workers to compute locally for steps without synchronizing with others, hence reducing communication frequency. While has been viewed as a hyperparameter to trade optimization efficiency for communication cost, recent research indicates that setting a proper value can lead to generalization improvement. Yet, selecting a proper is elusive. This work proposes a theory-grounded method for determining , named the Quadratic Synchronization Rule (QSR), which recommends dynamically setting in proportion to as the learning rate decays over time. Extensive ImageNet experiments on ResNet and ViT show that local gradient methods with QSR consistently improve the test accuracy over other synchronization strategies. Compared to the standard data parallel training, QSR enables Local AdamW to cut the training time on 16 or 64 GPUs down from 26.7 to 20.2 hours or from 8.6 to 5.5 hours and, at the same time, achieves or higher top-1 validation accuracy.
Chat is not available.