Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

Sub-Linear Memory: How to Make Performers SLiM

Valerii Likhosherstov · Krzysztof Choromanski · Jared Quincy Davis · Xingyou Song · Adrian Weller

Keywords: [ Transformers ]


Abstract: Transformer architectures have become very popular yet the original implementation requires O(L2) in serial time and memory as functions of input length L. Recent works proposed various linear self-attention mechanisms, scaling only as O(L) for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of L (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only O(1) memory, and still requires O(L) time. Due to complete backward-compatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.

Chat is not available.