Skip to yearly menu bar Skip to main content


Poster

Classification Done Right for Vision-Language Pre-Training

Zilong Huang · Haoqi Fan · Qinghao Ye · Bingyi Kang · Jiashi Feng

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

We present SuperClass, a super simple classification method that performs vision-language pre-training. Our method does not require a text encoder to be pre-trained on image-text data. Instead, it utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoder, SuperClass is more efficient than its contrastive counterpart of CLIP, and it has reported superior performance on various downstream tasks, including classic computer vision benchmarks and vision \& language downstream tasks. We further explore the scaling behavior compared to CLIP on model size, or training length, and report encouraging results and comparisons. We hope our work can inspire future vision-language research.

Live content is unavailable. Log in and register to view live content