NeurIPS Poster Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models

Poster

Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models

Hanxiao Zhang · Lin JU · Chan Wu · Jinjing Huang · Youshao Xiao · Zhenglei Zhou · Zhiming fan · Zhaoxin Huan · Siyuan Li · Fanzhuang Meng · Lei Liang · Xiaolu Zhang · Jun Zhou

West Ballroom A-D #6108

[ Abstract ] [ Project Page ]

[ Paper] [ Poster] [ OpenReview]

Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Recently, various strategies for distributed training of large language models (LLMs) have been proposed.By categorizing them into basic strategies and composite strategies, we have discovered that existing basic strategies provide limited options in specific scenarios, leaving considerable room for optimization in training speed.In this paper, we rethink the impact of memory and communication costs on the training speed of LLMs, taking into account the impact of intra- and inter-group communication performance disparities, and then propose a new set of basic strategies named the \textbf{Pa}rtial \textbf{R}edundancy \textbf{O}ptimizer (PaRO).PaRO Data Parallelism (PaRO-DP) accelerates LLM training through refined model state partitioning and tailored training procedures. At the same time, PaRO Collective Communications (PaRO-CC) speeds up collective communication operations by rearranging the topology. We also propose a guideline for choosing different DP strategies based on simple quantitative calculations, which yields minimal ranking errors.Our experiments demonstrate that PaRO improves the training speed of LLMs by up to 266\% that of ZeRO-3 as basic DP strategies.Moreover, employing PaRO-CC independently for model parallel strategies, such as Megatron, can also boost the training speed by 17\%.

Chat is not available.