NeurIPS Student as an Inherent Denoiser of Noisy Teacher

Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

Student as an Inherent Denoiser of Noisy Teacher

Jiachen Zhao

[ Abstract ]

Abstract:

Knowledge distillation (KD) has been widely employed to transfer knowledge from a large language model (LLM) to a specialized model in low-data regimes through pseudo label learning. However, pseudo labels generated by teacher models are usually noisy and may influence KD performance. This study delves into KD with noisy teachers and uncovers that the student model can already generate more accurate predictions than the teacher labels used to train it in the middle of KD, indicating its inherent ability to \textit{denoise} noisy teacher labels. Motivated by this finding, we propose Peer-Advised KD to improve vanilla KD from noisy teachers. Experiments show that Peer-Advised KD can outperform LLM by approximately 5\% with 50 human-labeled data, and even competitive to standard supervised finetuning with 750 human-labeled data.

Chat is not available.

Poster in Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

Student as an Inherent Denoiser of Noisy Teacher

Jiachen Zhao

Poster
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants