MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
Abstract
In recent years, large-scale generative models for visual content (spanning images, 3D scenes, and videos) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to their inherent multimodality, the long sequences of visual tokens involved, and the complex spatiotemporal dependencies. To address these challenges, we systematically designed the training framework, optimizing four key components: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure. These optimizations delivered significant efficiency gains and performance improvements across all stages of data processing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, achieves performance on par with recent state-of-the-art video generation models. Notably, it excels in e-commerce–oriented video generation tasks, outperforming leading open-source models in human evaluations.