Skip to yearly menu bar Skip to main content


Poster

Tell What You Hear From What You See - Video to Audio Generation Through Text

Xiulong Liu · Kun Su · Eli Shlizerman

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Scenes with visual and audio content are multi-faceted such that a video stream can be paired with various audio streams and vice-versa. Thereby, in video-to-audio generation task, the ability to steer the generated audio is imperative. While Video-to-Audio generation is a well-established generative task, existing methods lack such controllability. In this work, we propose \textit{\name}, a multi-modal generative framework that takes video and an optional text prompt as input, and generates audio and optional textual description (caption) of the audio. Such a framework has two unique advantages: i) Video-to-Audio generation process can be refined and controlled via text that complements the video context, and ii) The model can suggest what audio to generate for the video by generating audio captions. VATT consists of two key modules: VATT Converter, an instruction fine-tuned LLM with a projection layer that maps video features to the LLM vector space, and VATT Audio, a bi-directional transformer that generates audio tokens from visual frames and optional text prompt using iterative parallel decoding. The audio tokens and the text prompt are used by a pretrained neural codec to convert them into a waveform. Our experiments comparing VATT with existing video-to-audio generation methods in objective metrics on VGGSound audio-visual dataset show that VATT achieves competitive performance when audio caption is not provided and even more refined performance when a caption is given as a prompt (with lowest KLD score of 1.41). Furthermore, VATT Audio is consistently chosen as preferred audio over other methods in a subjective study. VATT enables video-to-audio captioning controllable visual-to-audio generation through text as well as suggesting text prompts for videos through audio captions, unlocking the novel applications such as text-guided video-to-audio generation and video-to-audio captioning. To check generated samples, please visit the https://anonymous.4open.science/w/VATT-6CB2/ (with Chrome browser).

Live content is unavailable. Log in and register to view live content