Skip to yearly menu bar Skip to main content

Workshop: Mathematics of Modern Machine Learning (M3L)

Attention-Only Transformers and Implementing MLPs with Attention Heads

Robert Huben · Valerie Morris


The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1 so long as the MLP's activation function comes from a restricted class including SiLU and close approximations of ReLU and GeLU. This allows one to convert an MLP-and-attention transformer into an attention-only transformer at the cost of greatly increasing the number of attention heads.

Chat is not available.