Timezone: »

Toward Efficient Training of Large Language Models with Balanced Conditional Compute
Luke Zettlemoyer

The trend of building ever larger language models has dominated much research in NLP over the last few years. However, we have reached a point where dense compute is difficult to scale further, and there is a need for new, more efficient model architectures. In this talk, I will cover our recent efforts on learning sparse mixtures of experts (MoEs) models, which have new explicitly balanced control mechanisms for allocating conditional compute. This includes BASE Layers, where the routing of experts to tokens is algorithmically assigned to ensure balanced scaling across compute nodes, and DEMix Layers, where we introduce new modular approaches for deterministic expert routing based on metadata that specifies the domain of the input text. Overall, our sparse approaches have significantly reduced cross-node communication costs and could possibly provide the next big leap in performance, although finding a version that scales well in practice remains an open challenge.

Author Information

Luke Zettlemoyer (University of Washington and Facebook)

More from the Same Authors