Poster
Demystify Mamba in Vision: A Linear Attention Perspective
Dongchen Han · Ziyi Wang · Zhuofan Xia · Yizeng Han · Yifan Pu · Chunjiang Ge · Jun Song · Shiji Song · Bo Zheng · Gao Huang
East Exhibit Hall A-C #2005
Mamba, an effective state space model with linear computation complexity, is seen as a promising method to deal with high-resolution images in various vision tasks. Nevertheless, the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically delivers poor performance and is suboptimal for application. In this paper, we explore the similarities and disparities between Mamba and linear attention Transformer, providing comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we begin with the formulas and rephrase Mamba as a variant of linear attention Transformer with six distinctions: input gate, forget gate, shortcut, no attention normalization, single-head and modified block design. For each design, we meticulously analyze its pros and cons and evaluate its suitability for vision tasks. Moreover, empirical studies are conducted to assess the impact of each design, highlighting the forget gate, and block design as the core contributors to Mamba's success. Based on our findings, we propose our \textit{Mamba-Like Linear Attention (MLLA)} model by incorporating the merits of these two key designs into linear attention. We empirically observe that MLLA outperforms various vision Mamba models on both image classification and high-resolution dense prediction tasks, while preserving parallelizable computation and fast inference speed. Code will be released.
Live content is unavailable. Log in and register to view live content