Petri Net Structure-Driven Video Generation
Abstract
Recent advances in video generation have unlocked new opportunities for simulating real-world activities. Yet, existing models often struggle to faithfully represent structured, multi-step processes—such as those found in business workflows—resulting in temporally inconsistent or semantically incoherent outputs. To address this gap, we propose Petri Net Structure-Driven Video Generation, an approach that leverages formal process models to guide the generation of coherent and semantically grounded video simulations. Specifically, we incorporate process-aware structural information through: (i) domain-specific prompting enriched with process semantics, (ii) storyboard construction using reference frames extracted from real-world process evidence, and (iii) synthetic reference frames informed by Petri Net structures. We evaluate our method across multiple domains and show that grounding generation in process model structure improves temporal coherence, semantic fidelity, and user-perceived realism. Our approach demonstrates how structured symbolic representations can enhance generative video systems, opening new directions for process-aware visual synthesis.