Scaling Deep Learning Optimization: Insights into Efficiency, Preconditioning, and Critical Batch Sizes
Sham Kakade
2024 Invited Talk
in
Workshop: Mathematics of Modern Machine Learning (M3L)
in
Workshop: Mathematics of Modern Machine Learning (M3L)
Abstract
Optimizing large-scale language models efficiently is critical as model sizes grow. This talk synthesizes insights from recent work on optimizer design, preconditioning, and critical batch size scaling. We compare widely used optimizers, revealing that practical considerations often outweigh performance differences and highlight the specific directions for improvement. Additionally, we establish new theoretical connections for Shampoo’s preconditioner and introduce SOAP, a hybrid method combining Shampoo's efficiency with Adam's simplicity, reducing wall-clock time significantly. Finally, we investigate how critical batch size scales with data, providing actionable insights for parallelism in large-scale training.
Video
Chat is not available.
Successful Page Load