Timezone: »

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
Ming Ding · Wendi Zheng · Wenyi Hong · Jie Tang

Thu Dec 01 09:00 AM -- 11:00 AM (PST) @ Hall J #925

Development of transformer-based text-to-image models is impeded by its slow generation and complexity, for high-resolution images. In this work, we put forward a solution based on hierarchical transformers and local parallel autoregressive generation. We pretrain a 6B-parameter transformer with a simple and flexible self-supervised task, a cross-modal general language model (CogLM), and fine-tune it for fast super-resolution. The new text-to-image system, CogView2, shows very competitive generation compared to concurrent state-of-the-art DALL-E-2, and naturally supports interactive text-guided editing on images.

Author Information

Ming Ding (Tsinghua University)
Wendi Zheng (Tsinghua University)
Wenyi Hong (Department of Computer Science and Technology, Tsinghua University)
Jie Tang (Tsinghua University)

More from the Same Authors