Timezone: »
Poster
Sub-Linear Memory: How to Make Performers SLiM
Valerii Likhosherstov · Krzysztof Choromanski · Jared Quincy Davis · Xingyou Song · Adrian Weller
Transformer architectures have become very popular yet the original implementation requires $O(L^2)$ in serial time and memory as functions of input length $L$. Recent works proposed various linear self-attention mechanisms, scaling only as $O(L)$ for serial computation. We conduct a thorough complexity analysis of Performers, a class which includes most recent linear Transformer mechanisms. We note a remarkable computational flexibility: the gradient computation can be performed with no approximations using sublinear memory as a function of $L$ (in addition to negligible storage for the input sequence), at a cost of greater time complexity in the parallel setting. In the extreme case, a Performer consumes only $O(1)$ memory, and still requires $O(L)$ time. Due to complete backward-compatibility, this discovered time-memory tradeoff can be used for fine-tuning on low-memory devices in a decentralized fashion without any server computations.
Author Information
Valerii Likhosherstov (University of Cambridge)
Krzysztof Choromanski (Google Brain Robotics & Columbia University)
Jared Quincy Davis (DeepMind | Stanford)
Xingyou Song (Google Brain)
Adrian Weller (University of Cambridge )
More from the Same Authors
-
2021 Spotlight: Iterative Teaching by Label Synthesis »
Weiyang Liu · Zhen Liu · Hanchen Wang · Liam Paull · Bernhard Schölkopf · Adrian Weller -
2022 : Controlling Commercial Cooling Systems Using Reinforcement Learning »
Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao -
2022 : Controlling Commercial Cooling Systems Using Reinforcement Learning »
Jerry Luo · Cosmin Paduraru · Octavian Voicu · Yuri Chervonyi · Scott Munns · Jerry Li · Crystal Qian · Praneet Dutta · Daniel Mankowitz · Jared Quincy Davis · Ningjia Wu · Xingwei Yang · Chu-Ming Chang · Ted Li · Rob Rose · Mingyan Fan · Hootan Nakhost · Tinglin Liu · Deeni Fatiha · Neil Satra · Juliet Rothenberg · Molly Carlin · Satish Tallapaka · Sims Witherspoon · David Parish · Peter Dolan · Chenyu Zhao -
2022 Poster: Chefs' Random Tables: Non-Trigonometric Random Features »
Valerii Likhosherstov · Krzysztof M Choromanski · Kumar Avinava Dubey · Frederick Liu · Tamas Sarlos · Adrian Weller -
2021 Poster: Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch »
Luca Viano · Yu-Ting Huang · Parameswaran Kamalaruban · Adrian Weller · Volkan Cevher -
2021 Poster: Iterative Teaching by Label Synthesis »
Weiyang Liu · Zhen Liu · Hanchen Wang · Liam Paull · Bernhard Schölkopf · Adrian Weller -
2020 Poster: Ode to an ODE »
Krzysztof Choromanski · Jared Quincy Davis · Valerii Likhosherstov · Xingyou Song · Jean-Jacques Slotine · Jacob Varley · Honglak Lee · Adrian Weller · Vikas Sindhwani -
2018 Poster: Geometrically Coupled Monte Carlo Sampling »
Mark Rowland · Krzysztof Choromanski · François Chalus · Aldo Pacchiano · Tamas Sarlos · Richard Turner · Adrian Weller -
2018 Spotlight: Geometrically Coupled Monte Carlo Sampling »
Mark Rowland · Krzysztof Choromanski · François Chalus · Aldo Pacchiano · Tamas Sarlos · Richard Turner · Adrian Weller -
2017 Poster: On Blackbox Backpropagation and Jacobian Sensing »
Krzysztof Choromanski · Vikas Sindhwani -
2017 Poster: The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings »
Krzysztof Choromanski · Mark Rowland · Adrian Weller