Skip to yearly menu bar Skip to main content

Spotlight Poster

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw · Mandar Joshi · James Cohan · Jonathan Berant · Panupong Pasupat · Hexiang Hu · Urvashi Khandelwal · Kenton Lee · Kristina N Toutanova

Great Hall & Hall B1+B2 (level 1) #304
[ ]
[ Paper [ Poster [ OpenReview
Tue 12 Dec 3:15 p.m. PST — 5:15 p.m. PST


Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use — via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

Chat is not available.