NeurIPS Efficient LLM Training and Inference on GPUs

Invited Talk
in
Workshop: Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization

Efficient LLM Training and Inference on GPUs

Mohammad Shoeybi · Bryan Catanzaro

[ Abstract ]

Abstract:

Abstract: Training and inference of large transformer models is one of the most important computational challenges of modern AI. Systems for training these models must be highly scalable and run at extreme efficiency, because the amount of work necessary to converge a model can be extraordinarily large. Inference needs to be fast and accommodate different query sizes. In this talk, I'll discuss the work we have been doing at NVIDIA to optimize systems for Large Language Model training and inference on GPUs. I will present different parallelism techniques we are using in our LLM framework Megatron-LM and will discuss how parallelism techniques can be combined to maximize the training throughput of large models while retaining strict optimizer semantics. I will discuss optimizations techniques for inference and methods to accelerate inference and reduce memory fragmentation.

Speaker's Bio: Dr. Mohammad Shoeybi is the Director of Applied Research at NVIDIA. His team focuses on building large foundational models and improving them to downstream applications. His team has build Megatron-LM, a framework for efficiently training LLMs and used it to train several large scale models such as Megatron-Turing NLG with 530 billions of parameters. He received his PhD. from Stanford University in 2010. Prior to NVIDIA, he worked at DeepMind and Baidu USA leading efforts on bringing deep learning and reinforcement learning to applications.

Chat is not available.

Invited Talk in Workshop: Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization

Efficient LLM Training and Inference on GPUs

Mohammad Shoeybi · Bryan Catanzaro

Invited Talk
in
Workshop: Workshop on Advancing Neural Network Training (WANT): Computational Efficiency, Scalability, and Resource Optimization