Timezone: »

Are Sixteen Heads Really Better than One?
Paul Michel · Omer Levy · Graham Neubig

Wed Dec 11 05:00 PM -- 07:00 PM (PST) @ East Exhibition Hall B + C #126

Multi-headed attention is a driving force behind recent state-of-the-art NLP models. By applying multiple attention mechanisms in parallel, it can express sophisticated functions beyond the simple weighted average. However we observe that, in practice, a large proportion of attention heads can be removed at test time without significantly impacting performance, and that some layers can even be reduced to a single head. Further analysis on machine translation models reveals that the self-attention layers can be significantly pruned, while the encoder-decoder layers are more dependent on multi-headedness.

Author Information

Paul Michel (Carnegie Mellon University, Language Technologies Institute)
Omer Levy (Facebook AI Research)
Graham Neubig (Carnegie Mellon University)

More from the Same Authors