ZeroTrail: Zero-Shot Trajectory Control Framework for Video Diffusion Models
Abstract
Recent large-scale text-to-video diffusion models demonstrated striking capability in synthesizing realistic clips, yet achieving effective control over objects' motion trajectories remains a challenging task. Prior attempts either required model-specific architecture modifications and costly training, or relied on zero-shot attention masks with limited effectiveness, or stacked multiple rounds of test-time latent optimization, achieving modest controllability at high computational cost and long running time. In this study, we introduce ZeroTrail, a novel zero-shot, tuning-free framework that equips video diffusion models with superior trajectory controllability without requiring alteration to the model architecture or incurring excessive inference time in multiple rounds. Our framework is composed of two key components: (i) a Trajectory Prior Injection Module (TPIM), which embeds the desired path into latent features through a single round of test-time training, and (ii) a Selective Attention Guidance Module (SAGM), which amplifies or attenuates cross-frame attention dynamically to reinforce the injected prior and preserve spatiotemporal coherence. Our framework is modular and requires no architectural modification, allowing it to be adapted to video diffusion models without requiring alterations to model architectures or additional training. Extensive experiments demonstrate that our framework consistently outperforms existing methods, accurately steering objects along complex trajectories while maintaining video diffusion models' ability to generate high-quality, consistent videos.