Skip to yearly menu bar Skip to main content

Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

LOVM: Language-Only Vision Model Selection

Orr Zohar · Shih-Cheng Huang · Kuan-Chieh Wang · Serena Yeung


Pre-trained multi-modal vision-language models (VLMs) excel in downstream applications, especially in the few- and zero-shot settings. However, choosing the optimal VLM for some downstream applications is challenging due to task and dataset dependencies. Exhaustive evaluation of all VLMs is impractical and requires the collection of a labeled dataset for evaluation. As the number of open-source VLM variants increases, there is a need for an efficient model selection strategy that does not require access to a curated evaluation dataset. To address this, we introduce a novel task, LOVM: Language-Only Vision Model Selection, where methods are expected to perform both model selection and performance prediction based solely on a text description of the desired downstream application. We also present an extensive LOVM benchmark consisting of ground-truth evaluations of 23 pre-trained VLMs and 35 datasets, enabling effective ranking and performance prediction of VLMs. Code and dataset will be publicly available upon publication.

Chat is not available.