Timezone: »
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training. We introduce a divided strategy that alternates between computing inter- and intra-modal attention across the visual and natural language modalities, which allows effective training via directly contrasting the two modalities' representations. We demonstrate the effectiveness of our approach by self-training on the HowTo100M instructional video dataset and evaluating on a newly collected dataset of localized described interactions in the YouCook2 dataset. We show that our approach outperforms alternative baselines, including shallow co-attention and full cross-modal attention. We also apply our approach to grounding phrases in images with weak supervision on Flickr30K and show that stacking multiple attention layers is effective and, when combined with a word-to-region loss, achieves state of the art on recall-at-one and pointing hand accuracies.
Author Information
Reuben Tan (Boston University)
Bryan Plummer (Boston University)
Kate Saenko (Boston University & MIT-IBM Watson AI Lab, IBM Research)

Kate is an AI Research Scientist at FAIR, Meta and a Full Professor of Computer Science at Boston University (currently on leave) where she leads the Computer Vision and Learning Group. Kate received a PhD in EECS from MIT and did postdoctoral training at UC Berkeley and Harvard. Her research interests are in Artificial Intelligence with a focus on out-of-distribution learning, dataset bias, domain adaptation, vision and language understanding, and other topics in deep learning. Past academic positions Consulting professor at the MIT-IBM Watson AI Lab 2019-2022. Assistant Professor, Computer Science Department at UMass Lowell Postdoctoral Researcher, International Computer Science Institute Visiting Scholar, UC Berkeley EECS Visiting Postdoctoral Fellow, SEAS, Harvard University
Hailin Jin (Adobe)
Bryan Russell (Intel Labs)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos »
Thu. Dec 9th 12:30 -- 02:00 AM Room
More from the Same Authors
-
2021 : Select, Label, and Mix: Learning Discriminative Invariant Feature Representations for Partial Domain Adaptation »
Aadarsh Sahoo · Rameswar Panda · Rogerio Feris · Kate Saenko · Abir Das -
2021 : Extending the WILDS Benchmark for Unsupervised Adaptation »
Shiori Sagawa · Pang Wei Koh · Tony Lee · Irena Gao · Sang Michael Xie · Kendrick Shen · Ananya Kumar · Weihua Hu · Michihiro Yasunaga · Henrik Marklund · Sara Beery · Ian Stavness · Jure Leskovec · Kate Saenko · Tatsunori Hashimoto · Sergey Levine · Chelsea Finn · Percy Liang -
2021 : Surprisingly Simple Semi-Supervised Domain Adaptation with Pretraining and Consistency »
Samarth Mishra · Kate Saenko · Venkatesh Saligrama -
2022 : Fifteen-minute Competition Overview Video »
Kate Saenko · Samarth Mishra · Dina Bashkirova · Vitaly Ablavsky · Sarah Bargal · Rachel Lai · Piotr Teterwak · James Akl · Fadi Alladkani · Donghyun Kim · Berk Calli -
2023 Poster: Cola: A Benchmark for Compositional Text-to-image Retrieval »
Arijit Ray · Filip Radenovic · Abhimanyu Dubey · Bryan Plummer · Ranjay Krishna · Kate Saenko -
2022 Competition: VisDA 2022 Challenge: Sim2Real Domain Adaptation for Industrial Recycling »
Dina Bashkirova · Samarth Mishra · Piotr Teterwak · Donghyun Kim · Rachel Lai · Fadi Alladkani · James Akl · Vitaly Ablavsky · Sarah Bargal · Berk Calli · Kate Saenko -
2022 : Challenge Introduction »
Dina Bashkirova · Samarth Mishra · Piotr Teterwak · Donghyun Kim · Sarah Bargal · Diala Lteif · Kate Saenko -
2022 : Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark »
Vitali Petsiuk · Alexander E. Siemenn · Saisamrit Surbehera · Qi Qi Chin · Keith Tyser · Gregory Hunter · Arvind Raghavan · Yann Hicke · Bryan Plummer · Ori Kerret · Tonio Buonassisi · Kate Saenko · Armando Solar-Lezama · Iddo Drori -
2022 Poster: DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations »
Ximeng Sun · Ping Hu · Kate Saenko -
2022 Poster: Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing »
Nataniel Ruiz · Sarah Bargal · Cihang Xie · Kate Saenko · Stan Sclaroff -
2022 Poster: How Transferable are Video Representations Based on Synthetic Data? »
Yo-whan Kim · Samarth Mishra · SouYoung Jin · Rameswar Panda · Hilde Kuehne · Leonid Karlinsky · Venkatesh Saligrama · Kate Saenko · Aude Oliva · Rogerio Feris -
2022 Poster: FETA: Towards Specializing Foundational Models for Expert Task Applications »
Amit Alfassy · Assaf Arbelle · Oshri Halimi · Sivan Harary · Roei Herzig · Eli Schwartz · Rameswar Panda · Michele Dolfi · Christoph Auer · Peter Staar · Kate Saenko · Rogerio Feris · Leonid Karlinsky -
2021 Workshop: Distribution shifts: connecting methods and applications (DistShift) »
Shiori Sagawa · Pang Wei Koh · Fanny Yang · Hongseok Namkoong · Jiashi Feng · Kate Saenko · Percy Liang · Sarah Bird · Sergey Levine -
2021 Poster: OpenMatch: Open-Set Semi-supervised Learning with Open-set Consistency Regularization »
Kuniaki Saito · Donghyun Kim · Kate Saenko -
2021 Poster: A Multi-Implicit Neural Representation for Fonts »
Pradyumna Reddy · Zhifei Zhang · Zhaowen Wang · Matthew Fisher · Hailin Jin · Niloy Mitra -
2021 : VisDA21: Visual Domain Adaptation + Q&A »
Kate Saenko · Kuniaki Saito · Donghyun Kim · Samarth Mishra · Ben Usman · Piotr Teterwak · Dina Bashkirova · Dan Hendrycks -
2021 Poster: Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing »
Aadarsh Sahoo · Rutav Shah · Rameswar Panda · Kate Saenko · Abir Das -
2020 Poster: Log-Likelihood Ratio Minimizing Flows: Towards Robust and Quantifiable Neural Distribution Alignment »
Ben Usman · Avneesh Sud · Nick Dufour · Kate Saenko -
2020 Poster: Uncertainty-Aware Learning for Zero-Shot Semantic Segmentation »
Ping Hu · Stan Sclaroff · Kate Saenko -
2020 Poster: Geo-PIFu: Geometry and Pixel Aligned Implicit Functions for Single-view Human Reconstruction »
Tong He · John Collomosse · Hailin Jin · Stefano Soatto -
2020 Poster: Universal Domain Adaptation through Self Supervision »
Kuniaki Saito · Donghyun Kim · Stan Sclaroff · Kate Saenko -
2020 Poster: Auxiliary Task Reweighting for Minimum-data Learning »
Baifeng Shi · Judy Hoffman · Kate Saenko · Trevor Darrell · Huijuan Xu -
2020 Poster: AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning »
Ximeng Sun · Rameswar Panda · Rogerio Feris · Kate Saenko -
2019 Poster: Adversarial Self-Defense for Cycle-Consistent GANs »
Dina Bashkirova · Ben Usman · Kate Saenko -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell -
2016 : Invited Talk: Domain Adaption for Perception and Action (Kate Saenko, Boston University) »
Kate Saenko -
2015 Workshop: Transfer and Multi-Task Learning: Trends and New Perspectives »
Anastasia Pentina · Christoph Lampert · Sinno Jialin Pan · Mingsheng Long · Judy Hoffman · Baochen Sun · Kate Saenko