Timezone: »
Autoregressive models and their sequential factorization of the data likelihood have recently demonstrated great potential for image representation and synthesis. Nevertheless, they incorporate image context in a linear 1D order by attending only to previously synthesized image patches above or to the left. Not only is this unidirectional, sequential bias of attention unnatural for images as it disregards large parts of a scene until synthesis is almost complete. It also processes the entire image on a single scale, thus ignoring more global contextual information up to the gist of the entire scene. As a remedy we incorporate a coarse-to-fine hierarchy of context by combining the autoregressive formulation with a multinomial diffusion process: Whereas a multistage diffusion process successively compresses and removes information to coarsen an image, we train a Markov chain to invert this process. In each stage, the resulting autoregressive ImageBART model progressively incorporates context from previous stages in a coarse-to-fine manner. Experiments demonstrate the gain over current autoregressive models, continuous diffusion probabilistic models, and latent variable models. Moreover, the approach enables to control the synthesis process and to trade compression rate against reconstruction accuracy, while still guaranteeing visually plausible results.
Author Information
Patrick Esser (Runway ML / Heidelberg University)
Robin Rombach (Heidelberg University, LMU Munich)
Andreas Blattmann (Heidelberg University, LMU Munich)
Bjorn Ommer (Heidelberg University)

Björn Ommer is a full professor at University of Munich where he is heading the Computer Vision & Learning Group. Before he was a full professor in the department of mathematics and computer science at Heidelberg University and a co-director of its Interdisciplinary Center for Scientific Computing. He received his diploma in computer science from University of Bonn, his PhD from ETH Zurich, and he was a postdoc at UC Berkeley. Björn serves as an associate editor for IEEE T-PAMI. His research interests include semantic scene understanding and retrieval, generative AI and visual synthesis, self-supervised metric and representation learning, and explainable AI. Moreover, he is applying this basic research in interdisciplinary projects within neuroscience and the digital humanities. His group has published a series of generative approaches, including "VQGAN" and "Stable Diffusion", which are now democratizing the creation of visual content and have already opened up an abundance of new directions in research, industry, the media, and beyond.
More from the Same Authors
-
2020 : A Note on Data Biases in Generative Models »
Patrick Esser -
2022 Poster: Retrieval-Augmented Diffusion Models »
Andreas Blattmann · Robin Rombach · Kaan Oktay · Jonas Müller · Björn Ommer -
2021 Poster: Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning »
Timo Milbich · Karsten Roth · Samarth Sinha · Ludwig Schmidt · Marzyeh Ghassemi · Bjorn Ommer -
2020 : 16 - An Image is Worth 16 × 16 Tokens: Visual Priors for Efficient Image Synthesis with Transformers »
Robin Rombach -
2020 Poster: Network-to-Network Translation with Conditional Invertible Neural Networks »
Robin Rombach · Patrick Esser · Bjorn Ommer -
2020 Oral: Network-to-Network Translation with Conditional Invertible Neural Networks »
Robin Rombach · Patrick Esser · Bjorn Ommer -
2016 Poster: CliqueCNN: Deep Unsupervised Exemplar Learning »
Miguel A Bautista · Artsiom Sanakoyeu · Ekaterina Tikhoncheva · Bjorn Ommer -
2012 Poster: Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity »
Angela Eigenstetter · Bjorn Ommer