Eric Xing: PAN: A Stateful, Interactable, and Long-horizon World Model
Abstract
World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual or robotic agents with artificial (general) intelligence. In this talk, starting from the imagination in the Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in the psychology literature, I discuss several schools of thoughts of world modeling, and lay the ground of a Physical, Agentic, and Nested (PAN) world model whose primary goal is to simulate all actionable possibilities of the real world for purposeful reasoning and planning via thought-experiment, in order to perform long-term, causal, and coherent actions toward a goal (or goals), rather than optimizing short-horizon visual metrics such as frame level fidelity or motion realism. We propose a Generative Latent Prediction (GLP) architecture that builds on stateful latent space, long-term and close-loop action-conditioned latent reasoning, inference grounding over realizable world states, and training through both SSL and RL. And we present PAN, built on the GLP architecture that brings together perception, state, action, and causality within one model to supports open-domain interactable world simulation. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models.