Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on Machine Learning and Compression

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Makoto Shing · Kou Misaki · Han Bao · Sho Yokoi · Takuya Akiba


Abstract: Large language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation (KD) offers a promising approach for model compression, yet existing methods struggle with mode averaging, mode collapse, and the substantial capacity gap between teacher and student models.To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID), a novel KD approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. This approach provides a smooth knowledge transfer, effectively addressing the capacity gap while balancing mode-averaging and mode-collapse tendencies.Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: TAID-LLM-1.5B for language tasks and TAID-VLM-2B for vision-language tasks.These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

Chat is not available.