Skip to yearly menu bar Skip to main content


Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Kumar Ashutosh · Santhosh Kumar Ramakrishnan · Triantafyllos Afouras · Kristen Grauman

Great Hall & Hall B1+B2 (level 1) #129
[ ]
Tue 12 Dec 8:45 a.m. PST — 10:45 a.m. PST


Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or the steps of a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a particular sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional video, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

Chat is not available.