Timezone: »
We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, they completely disregard the spatial cues of audio and visual signals naturally occurring in the real world. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360\degree video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360\degree video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition and video semantic segmentation. Dataset and code are available at https://github.com/pedro-morgado/AVSpatialAlignment.
Author Information
Pedro Morgado (University of California, San Diego)
Yi Li (UC San Diego)
Nuno Nvasconcelos (UC San Diego)
More from the Same Authors
-
2020 Poster: Contrastive Learning with Adversarial Examples »
Chih-Hui Ho · Nuno Nvasconcelos -
2019 Poster: Deliberative Explanations: visualizing network insecurities »
Pei Wang · Nuno Nvasconcelos -
2018 Poster: Self-Supervised Generation of Spatial Audio for 360° Video »
Pedro Morgado · Nuno Nvasconcelos · Timothy Langlois · Oliver Wang -
2016 Poster: Large Margin Discriminant Dimensionality Reduction in Prediction Space »
Ehsan Saberian · Jose Costa Pereira · Nuno Nvasconcelos · Can Xu