Poster
Rethinking Memory and Communication Costs for Efficient Data Parallel Training of Large Language Models
Hanxiao Zhang · Lin JU · Chan Wu · Jinjing Huang · Youshao Xiao · Zhenglei Zhou · Zhiming fan · Zhaoxin Huan · Siyuan Li · Fanzhuang Meng · Lei Liang · Xiaolu Zhang · Jun Zhou
West Ballroom A-D #6108
Recently, various strategies for distributed training of large language models (LLMs) have been proposed.By categorizing them into basic strategies and composite strategies, we have discovered that existing basic strategies provide limited options in specific scenarios, leaving considerable room for optimization in training speed.In this paper, we rethink the impact of memory and communication costs on the training speed of LLMs, taking into account the impact of intra- and inter-group communication performance disparities, and then propose a new set of basic strategies named the \textbf{Pa}rtial \textbf{R}edundancy \textbf{O}ptimizer (PaRO).PaRO Data Parallelism (PaRO-DP) accelerates LLM training through refined model state partitioning and tailored training procedures. At the same time, PaRO Collective Communications (PaRO-CC) speeds up collective communication operations by rearranging the topology. We also propose a guideline for choosing different DP strategies based on simple quantitative calculations, which yields minimal ranking errors.Our experiments demonstrate that PaRO improves the training speed of LLMs by up to 266\% that of ZeRO-3 as basic DP strategies.Moreover, employing PaRO-CC independently for model parallel strategies, such as Megatron, can also boost the training speed by 17\%.
Live content is unavailable. Log in and register to view live content