Skip to yearly menu bar Skip to main content


Poster

Generating compositional scenes via Text-to-image RGBA Instance Generation

Alessandro Fontanella · Petru-Daniel Tudosiu · Yongxin Yang · Shifeng Zhang · Sarah Parisot

East Exhibit Hall A-C #2409
[ ] [ Project Page ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. In this work, we propose to address layout-driven controllable image generation from a multi-layer perspective. We devise a novel training paradigm to adapt a diffusion model to generate isolated objects as RGBA images with transparency information. To build complex scenes, we then generate object scene components individually and introduce a multi-layer noise blending strategy to build a realistic composite scene. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate complex scenes with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Live content is unavailable. Log in and register to view live content