Skip to yearly menu bar Skip to main content

Workshop: Instruction Tuning and Instruction Following

For Distillation, Tokens Are Not All You Need

Mrigank Raman · Pranav Mani · Davis Liang · Zachary Lipton

Keywords: [ Distillation ] [ LLaMa ] [ Large language models ] [ Instruction Tuning ]

Abstract: The unwieldy size of state-of-the-art language models presents significant obstacles for deployment,driving up cost and latency. While prior works have offered methods for distilling these larger language modelsinto smaller students, the best previous method is somewhat complex,relying on an RL-based optimization. In this work, we introduce SLIM (Sparse Logit Infused Modeling), a simple method for distilling LLMsthat leverages not only samples from the teacher LLMbut also the values of the logits produced at each decoding step. Our distillation method uses only the top-5% highest logits along with a dynamic weighting scheme that assigns weights to the KL divergence and cross-entropy loss based on the relative confidence between the student and teacher models.Our experiments demonstrate that SLIM produces modelsthat are better at a wide range of downstream NLP tasks compared to supervised fine-tuning, vanilla knowledge distillation, and the recently proposed MiniLLM. Contrary to other methods, our method is scalable to much larger teacher ($\sim70$B parameters).We also provide an intuition for the superior performance of SLIM via established sample complexity bounds within simplified scenarios.

Chat is not available.