NeurIPS Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

KeyNote Talk
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Luke Zettlemoyer

[ Abstract ]

Abstract:

Existing language model (LM) training regimes entangle compute, data, and parameters, requiring expensive synchronous communication with massive supercomputers. This talk introduces a new algorithm called Branch-Train-Merge (BTM) that asynchronously trains LMs that are fundamentally modular. In BTM, components (or experts) of the LM are specialized to distinct domains in the training corpus, and experts are conditionally updated based on the domain of the incoming document. We show how BTM enables LMs that are rapidly customizable (with the ability to mix, add, or remove experts after training), embarrassingly parallel (requiring no communication between experts), and sparse (needing only a few experts active at a time for inference). Key to our proposal is exploring what constitutes the domains to which experts specialize, as well as reflecting on the data sources used to train LMs. Our new techniques chart a path towards collaborative and iterative LM development, where anyone can contribute and maintain experts at modest computational cost.

Chat is not available.

KeyNote Talk in Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Luke Zettlemoyer

KeyNote Talk
in
Workshop: Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III): Towards the Future of Large Language Models and their Emerging Descendants