Poster
S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity
Xinyu Yang · Jixuan Leng · Geyang Guo · Jiawei Zhao · Ryumei Nakada · Linjun Zhang · Huaxiu Yao · Beidi Chen
West Ballroom A-D #7101
[
Abstract
]
Fri 13 Dec 11 a.m. PST
— 2 p.m. PST
Abstract:
Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S$^{2}$FT accomplishes this by "selecting sparsely and computing densely". It selects only a few heads in the MHA module and a few channels in the FFN module for each Transformer Block. Next, it co-permutes the weight matrices on both sides of the coupled structures in LLMs to connect these selected components, yielding multiple compact, dense, and trainable weight submatrices. Finally, \model updates these submatrices to perform PEFT.Through theoretical analysis and empirical results, our method prevents overfitting and forgetting, delivers SOTA performance on established benchmarks with improvements up to 4.1%, and outperforms full FT by 7.1% in generalization tasks. By integrating our partial back-propagation algorithm, S$^{2}$FT saves the fine-tuning memory upto 3$\times$ and improves the throughput by 1.5 - 2.7$\times$ compared to full FT. We further show that S$^{2}$FT:can be written into low-rank format, enabling scalable batched serving of multiple fine-tuned models.
Live content is unavailable. Log in and register to view live content