Skip to yearly menu bar Skip to main content

Workshop: Instruction Tuning and Instruction Following

A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts

Oleksiy Ostapenko · Lucas Page-Caccia · Zhan Su · Nicolas Le Roux · Laurent Charlin · Alessandro Sordoni

Keywords: [ parameter efficient fine-tuning ] [ Instruction Tuning ] [ mixture of experts ]


We study the applicability of mixture of parameter-efficient experts (MoPEs) for instruction-tuning large decoder-only language models. Recent literature indicates that MoPEs might enhance performance in specific multi-task instruction-following datasets. In this paper, we extend such previous results and study applicability of MoPEs in settings previously overlooked: a) with open-domain instruction-following datasets; b) with recent decoder-only models and c) with downstream out-of-distribution test sets. We build on top of LLaMA1-13B/-7B and LLaMA2-13B. We study different variants of learned routing, namely per-example routing ([PE]), and a more expensive per-token ([PT]) routing. Overall, we are unable to substantiate strong performance gains observed in related studies in our setting. We observe occasional enhancements of LLAMA2 fine-tuned on Open Platypus dataset in 0-shot SNI evaluation and TruthfulQA evaluation after fine-tuning on a subset of Flan. We shed some light on the inner workings of MoPEs by comparing different routing strategies. We find that [PE] routing tends to collapse at downstream evaluation time reducing the importance of router's application. We plan to publicly release our code.

Chat is not available.