Timezone: »

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
Yanjing Li · Sheng Xu · Baochang Zhang · Xianbin Cao · Peng Gao · Guodong Guo

Tue Nov 29 09:00 AM -- 11:00 AM (PST) @ Hall J #628

The large pre-trained vision transformers (ViTs) have demonstrated remarkable performance on various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. Among the powerful compression approaches, quantization extremely reduces the computation and memory consumption by low-bit parameters and bit-wise operations. However, low-bit ViTs remain largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through extensive empirical analysis, we first identify the bottleneck for severe performance drop comes from the information distortion of the low-bit quantized self-attention map. We then develop an information rectification module (IRM) and a distribution guided distillation (DGD) scheme for fully quantized vision transformers (Q-ViT) to effectively eliminate such distortion, leading to a fully quantized ViTs. We evaluate our methods on popular DeiT and Swin backbones. Extensive experimental results show that our method achieves a much better performance than the prior arts. For example, our Q-ViT can theoretically accelerates the ViT-S by 6.14x and achieves about 80.9% Top-1 accuracy, even surpassing the full-precision counterpart by 1.0% on ImageNet dataset. Our codes and models are attached on https://github.com/YanjingLi0202/Q-ViT

Author Information

Yanjing Li (Beihang University)
Sheng Xu (Beihang University)
Baochang Zhang (Beihang University)
Xianbin Cao (Beihang University)
Peng Gao (Shanghai AI Lab)
Guodong Guo (West Virginia University)

More from the Same Authors