Keynote Talk 6: Dr. Mohamed H. Elhoseiny
Abstract
Towards Imaginative Perception: A Decade+ Journey Towards Human-level imaginative AI skills transforming Species Discovery and Ecology
The rise of Vision-Language Models (VLMs) has opened new frontiers in video understanding, yet scaling these systems to handle long, complex, and structured visual narratives remains a fundamental challenge. In this talk, we explore a suite of recent advances aimed at achieving scalable and structured video comprehension through Large Vision-Language Models (Video LLMs). We begin by examining instructable models like MiniGPT-4 (image-based), MiniGPT-4-v2(image-based), MiniGPT-3D, and its video extension, which leverage multimodal instruction tuning for diverse video tasks. We then introduce Long Video LLMs—including the Goldfish and LongVU models—that tackle the token explosion problem with retrieval-augmented generation and spatiotemporal compression. Further, we address structured understanding via StoryGPT-V and Vgent, which model character consistency and entity-based graph reasoning. To rigorously evaluate progress, we present InfiniBench and SpookyBench, two novel benchmarks designed to probe long-form comprehension and temporal perception in state-of-the-art models. Finally, we extend the discussion to include multimodal capabilities in multilingual, emotional, and action-driven contexts, as well as exploratory work on bridging vision and brain signals.