Oral
in
Workshop: What Makes a Good Video: Next Practices in Video Generation and Evaluation

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Chaoyue Wang · Zhongyi Fan · Yongshun Zhang · Yonghang Zhang · Zhangzikang Li · Weifeng Chen · Zhongwei Feng · Peng Hou · Anxiang Zeng

Project Page [ OpenReview]

Abstract

In recent years, large-scale generative models for visual content (spanning images, 3D scenes, and videos) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to their inherent multimodality, the long sequences of visual tokens involved, and the complex spatiotemporal dependencies. To address these challenges, we systematically designed the training framework, optimizing four key components: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure. These optimizations delivered significant efficiency gains and performance improvements across all stages of data processing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, achieves performance on par with recent state-of-the-art video generation models. Notably, it excels in e-commerce–oriented video generation tasks, outperforming leading open-source models in human evaluations.

Video

Chat is not available.