StructMoE : Structured Mixture of Experts Using Low Rank Experts
Zain Sarwar ⋅ Ashwinee Panda ⋅ Benjamin Thérien ⋅ Stephen Rawls ⋅ Anirban Das ⋅ Kartik Balasubramaniam ⋅ Berkcan Kapusuzoglu ⋅ Shixiong Zhang ⋅ Sambit Sahu ⋅ MILIND NAPHADE ⋅ Supriyo Chakraborty
Keywords:
Efficient Architectures
Abstract
We introduce StructMoE, a method to scale MoE architectures by augmenting experts with dynamic capacity using structured matrices we call Low Rank Experts (LoRE). These LoREs are selected on a per-expert and per-token basis using a secondary router specific to every expert and are entangled with the main expert in the up-projection phase of the expert before the activation function. Empirically, we find this approach to outperform an MoE baseline in terms of loss on a held out validation set.
Video
Chat is not available.
Successful Page Load