Skip to yearly menu bar Skip to main content


Poster

On the Comparison between Multi-modal and Single-modal Contrastive Learning

Wei Huang · Andi Han · Yongqiang Chen · Yuan Cao · Zhiqiang Xu · Taiji Suzuki


Abstract:

Multimodal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multimodal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially about its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multimodal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on the ReLU network with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multimodal and single-modal contrastive learning. Through the cooperation between the two modalities, multimodal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multimodal contrastive learning. Empirical simulations further confirm our theoretical conclusions.

Live content is unavailable. Log in and register to view live content