Timezone: »

Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos
Reuben Tan · Bryan Plummer · Kate Saenko · Hailin Jin · Bryan Russell

Wed Dec 08 04:30 PM -- 06:00 PM (PST) @

We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.

Author Information

Reuben Tan (Boston University)
Bryan Plummer (Boston University)
Kate Saenko (Boston University & MIT-IBM Watson AI Lab, IBM Research)
Kate Saenko

Kate is an AI Research Scientist at FAIR, Meta and a Full Professor of Computer Science at Boston University (currently on leave) where she leads the Computer Vision and Learning Group. Kate received a PhD in EECS from MIT and did postdoctoral training at UC Berkeley and Harvard. Her research interests are in Artificial Intelligence with a focus on out-of-distribution learning, dataset bias, domain adaptation, vision and language understanding, and other topics in deep learning. Past academic positions Consulting professor at the MIT-IBM Watson AI Lab 2019-2022. Assistant Professor, Computer Science Department at UMass Lowell Postdoctoral Researcher, International Computer Science Institute Visiting Scholar, UC Berkeley EECS Visiting Postdoctoral Fellow, SEAS, Harvard University

Hailin Jin (Adobe)
Bryan Russell (Intel Labs)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors