Skip to yearly menu bar Skip to main content

Workshop: OPT 2023: Optimization for Machine Learning

Multi-head CLIP: Improving CLIP with Diverse Representations and Flat Minima

Mo Zhou · Xiong Zhou · Erran Li Li · Stefano Ermon · Rong Ge


Contrastive Language-Image Pre-training (CLIP) has shown remarkable success in the field of multimodal learning by enabling joint understanding of text and images. In this paper, we introduce a novel method called Multi-head CLIP, inspired by Stein Variational Gradient Descent (SVGD) and Sharpness-aware Minimization (SAM). Our approach aims to enhance CLIP's learning capability by encouraging the model to acquire diverse features while also promoting convergence towards a flat loss region, resulting in improved generalization performance. We conduct extensive experiments on two benchmark datasets, YFCC15M and CC3M, to evaluate the effectiveness of our proposed method. The experimental results consistently demonstrate that multi-head CLIP outperforms both the original CLIP architecture and CLIP with the SAM optimizer.

Chat is not available.