Skip to yearly menu bar Skip to main content


Focus Your Attention when Few-Shot Classification

Haoqing Wang · Shibo Jie · Zhihong Deng

Great Hall & Hall B1+B2 (level 1) #1121
[ ]
[ Paper [ Slides [ Poster [ OpenReview
Thu 14 Dec 3 p.m. PST — 5 p.m. PST


Since many pre-trained vision transformers emerge and provide strong representation for various downstream tasks, we aim to adapt them to few-shot image classification tasks in this work. The input images typically contain multiple entities. The model may not focus on the class-related entities for the current few-shot task, even with fine-tuning on support samples, and the noise information from the class-independent ones harms performance. To this end, we first propose a method that uses the attention and gradient information to automatically locate the positions of key entities, denoted as position prompts, in the support images. Then we employ the cross-entropy loss between their many-hot presentation and the attention logits to optimize the model to focus its attention on the key entities during fine-tuning. This ability then can generalize to the query samples. Our method is applicable to different vision transformers (e.g., columnar or pyramidal ones), and also to different pre-training ways (e.g., single-modal or vision-language pre-training). Extensive experiments show that our method can improve the performance of full or parameter-efficient fine-tuning methods on few-shot tasks. Code is available at

Chat is not available.