Skip to yearly menu bar Skip to main content


Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang · Yufan Feng · Cheng Shi · Lan Xu · Jingyi Yu · Sibei Yang

Great Hall & Hall B1+B2 (level 1) #610
[ ] [ Project Page ]
[ Paper [ Poster [ OpenReview
Thu 14 Dec 8:45 a.m. PST — 10:45 a.m. PST


Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of ``moving images'', we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

Chat is not available.