EMM-1: Expanding embedding space to be multimodal and multilingual
Abstract
Many multimodal systems are really bi-modal - or maybe tri-modal. The shift to multimodal AI has created an urgent need for large-scale, high-quality training data spanning more modalities. The EMM-1 dataset represents the largest and highest quality dataset of five modalities. The dataset has three parts: x000D i) A large, >100M sample, automatically generated dataset of quintuples consisting of matching (caption, image, video, audio, and point clouds); x000D ii) a human-rated subset comprising ~1M ratings of pairs amongst the five modalities; x000D iii) a novel, first-of-its-kind, consensus-based evaluation set (3.5K data points) to evaluate zero-shot capabilities between audio and point clouds. x000D With the release, we hope to accelerate the development of truly multimodal applications. To demonstrate the usefulness of the dataset, we publish a simple, yet powerful, baseline model that demonstrates strong cross-modal retrieval performance. While powerful, the model leaves substantial headroom for further optimization. To name but a few, attention over full token sequences, quality-weighted objectives, and expanded fine-tuning. By expanding captions to multiple languages, we're unlocking this dataset for teams building multimodal AI worldwide.