Oral
in
Workshop: Foundation Models for Decision Making

A Universal World Model Learned from Large Scale and Diverse Videos

Hanchen Cui ⋅ Yang Gao

Project Page [ OpenReview]

Abstract

World models play a crucial role in model-based reinforcement learning (RL) by providing predictive representations of an agent in an environment and enabling the agent to reason about the future and make more informed decisions. However, there are still two main problems limiting the applications of world models. First, current methods typically train the world models using only massive domain-specific data, making it challenging to generalize to unseen scenarios or adapt to changes in the environments. Second, it is difficult to define the actions when world models are trained using in the wild videos. In this work, we tackle these two problems by learning a general purpose world model from a diverse and large scale real world video dataset with extracted latent actions. Specifically, our approach leverages a pre-trained vision encoder to project the images of two adjacent frames into states; then, extracts the latent actions into a low dimensional space based on vector quantization; finally, a dynamic function is learned using latent actions. Results show that the proposed generic world model can successfully extract latent actions of arbitrary neighboring frames when testing on in the wild video dataset. Furthermore, fine-tuning on only a small amount of in-domain data can significantly improve the accuracy of the generic world model when adapting to unseen environments.

Chat is not available.