Skip to yearly menu bar Skip to main content


Poster

A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning

Yixiong Zou · Shuai Yi · Yuhua Li · Ruixuan Li

East Exhibit Hall A-C #3509
[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Vision Transformer (ViT) has shown great power in learning from large-scale datasets. However, collecting sufficient data for expert knowledge is always difficult. To handle this problem, Cross-Domain Few-Shot Learning (CDFSL) has been proposed to transfer the source-domain knowledge learned from sufficient data to target domains where only scarce data is available. In this paper, we find an intriguing phenomenon neglected by previous works for the CDFSL task based on ViT: leaving the CLS token to random initialization, instead of loading source-domain trained parameters, could consistently improve target-domain performance. We then delve into this phenomenon for an interpretation. We find the CLS token naturally absorbs domain information due to the inherent structure of the ViT, which is represented as the low-frequency component in the Fourier frequency space of images. Based on this phenomenon and interpretation, we further propose a method for the CDFSL task to decouple the domain information in the CLS token during the source-domain training, and adapt the CLS token on the target domain for efficient few-shot learning. Extensive experiments on four benchmarks validate our rationale and state-of-the-art performance. Our codes will be released.

Live content is unavailable. Log in and register to view live content