Skip to yearly menu bar Skip to main content


Spotlight Poster

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Kevin Qinghong Lin · Linjie Li · Difei Gao · Qinchen WU · Mingyi Yan · Zhengyuan Yang · Lijuan Wang · Mike Zheng Shou

East Exhibit Hall A-C #3103
[ ] [ Project Page ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as \``Insert a new slide.'' However, the derived methods often struggle with complex, visually-intensive software tasks in the real world, such as ``recreating a specific animation effect shown in a video.'' The challenges include visual perception, lengthy procedural planning, and executing multiple actions. Recognizing that humans frequently rely on instructional videos to master complex skills, we introduce \textbf{\our}, a novel multi-modal benchmark designed to evaluate GUI assistants across multiple dimensions of advanced GUI tasks. Sourced from high-quality web instructional videos, \our focuses on advanced tasks involving professional and novel software (\eg Adobe Photoshop or Stable Diffusion WebUI) and complex activities (\eg video editing). Moreover, \our evaluates GUI assistants through a \textit{hierarchical} process, allowing for identification of the specific levels at which they may fail: \textbf{($i$) high-level planning:} reconstruct procedural subtasks from visual conditions without language descriptions; \textbf{($ii$) middle-level planning:} generate sequences of precise action narrations based on visual state (\ie screenshot) and goals; % transitions \textbf{($iii$) atomic action execution:} perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. We evaluate representative large multimodal models on \our, revealing each model's capabilities on these different levels. We observe that the current best model GPT-4o, while proficient at planning from textual queries, still struggles with reversal planning from visual previews and execute certain actions such as dragging. These gaps show the direction for developing stronger models or agent systems for GUI automation from instructional videos.

Live content is unavailable. Log in and register to view live content