Foundation models pretrained on diverse vision and language datasets have demonstrated exceptional capabilities in performing a wide range of downstream vision and language tasks. As foundation models are deployed in real-world applications such as dialogue, autonomous driving, healthcare, and robotics, they inevitably face new challenges such as learning from external feedback, adapting to different task modalities, and performing long-term reasoning and planning. Such challenges have traditionally been at the core of sequential decision making, encompassing areas such as reinforcement learning, imitation learning, planning, search, and optimal control. These research fields have traditionally focused on task-specific settings with limited prior knowledge, and yet there has been significant research progress in surpassing human performance in tasks like playing board games and Atari video games, as well as operating robots to complete navigation and manipulation tasks. However, since these methods generally learn to solve a specific task from scratch without broad knowledge from vision and language, they can struggle with generalization and sample efficiency. The goal of this workshop is to bring together the sequential decision making community including planning, search, RL, and optimal control, together with the foundation models community in vision and language to confront the challenges in decision making at scale. The workshop will span high-level discussions on how foundation models and decision making can benefit each other when jointly considered and low-level algorithmic details of various decision making algorithms and vision-language architectures, which might lead to both opportunities or challenges. Specific topics, for example, will include foundation model agents interacting with humans, computers, tools, simulators, physical world, and each other.
Live content is unavailable. Log in and register to view live content