Poster

Extending Multi-modal Contrastive Representations

Ziang Zhang ⋅ Zehan Wang ⋅ Luping Liu ⋅ Rongjie Huang ⋅ Xize Cheng ⋅ Zhenhui Ye ⋅ wang lin ⋅ Huadai Liu ⋅ Haifeng Huang ⋅ Yang Zhao ⋅ Tao Jin ⋅ Siqi Zheng ⋅ Zhou Zhao

2024 Poster

[ Paper] [ OpenReview]

Abstract

Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes $\textbf{Ex}$tending $\textbf{M}$ultimodal $\textbf{C}$ontrastive $\textbf{R}$epresentation (Ex-MCR), a training-efficient and paired-data-free method to build unified contrastive representation for many modalities. Since C-MCR is designed to learn a new latent space for the two non-overlapping modalities and projects them onto this space, a significant amount of information from their original spaces is lost in the projection process. To address this issue, Ex-MCR proposes to extend one modality's space into the other's, rather than mapping both modalities onto a completely new space. This method effectively preserves semantic alignment in the original space. Experimentally, we extend pre-trained audio-text and 3D-image representations to the existing vision-text space. Without using paired data, Ex-MCR achieves comparable performance to advanced methods on a series of audio-image-text and 3D-image-text tasks and achieves superior performance when used in parallel with data-driven methods. Moreover, semantic alignment also emerges between the extended modalities (e.g., audio and 3D).

Video

Chat is not available.