Skip to yearly menu bar Skip to main content


Poster

HourVideo: 1-Hour Video-Language Understanding

Keshigeyan Chandrasegaran · Agrim Gupta · Manling Li · Taran Kota · Lea M. Hadzic · Jimming He · Cristobal Eyzaguirre · Zane Durante · Jiajun Wu · Fei-Fei Li

West Ballroom A-D #5109
[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We present HourVideo, a benchmark dataset for one hour video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. The benchmark includes 500 egocentric videos from the Ego4D dataset, spanning durations from 20 to 120 minutes, and features 13,000 high-quality five-way multiple-choice questions. Initial benchmarking results show that multimodal models like GPT-4V and LLaVA-NeXT perform only marginally above random chance. In contrast, human baselines significantly outperform the state-of-the-art long-context multimodal model Gemini Pro 1.5 (84% vs. 40%), suggesting substantial research gap. Our benchmark, evaluation toolkit, baseline results, prompts, and documentation are included in the Supplementary materials and will be made publicly available.

Live content is unavailable. Log in and register to view live content