Timezone: »
Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of object tokens that can be used across images and videos. Second, the scene representations of individual frames in video should ``align'' with those of still images. This is achieved via a Frame-Clip Consistency loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a Hand-Object Graph, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets, including the first place in the Ego4D CVPR'22 Point of No Return Temporal Localization Challenge. For code and pretrained models, visit the project page at https://eladb3.github.io/SViT/.
Author Information
Elad Ben Avraham (Tel Aviv University)
Roei Herzig (Tel Aviv University)
Karttikeya Mangalam (UC Berkeley (BAIR))
I’m a first year PhD student in Computer Science at the Department of Electrical Engineering & Computer Sciences (EECS) at University of California, Berkeley where I’m jointly advised by Prof. Jitendra Malik and Prof. Yi Ma.
Amir Bar (TAU / UC Berkeley)

Amir Bar is a fourth-year Ph.D. candidate at Tel Aviv University and a Visiting Ph.D. Researcher at UC Berkeley, advised by Amir Globerson and Trevor Darrell. His primary research area centers around self-supervised learning and how to use large amounts of unlabeled images and videos to enable computers to develop visual understanding. Lately, his focus has been on improving learning algorithms for Masked Image Modeling and Visual Prompting, which involves adapting computer vision models during test time for novel computer vision tasks without changing the model weights or task-specific fine-tuning.
Anna Rohrbach (UC Berkeley)
Leonid Karlinsky (Weizmann Institute of Science)
Trevor Darrell (Electrical Engineering & Computer Science Department)
Amir Globerson (Tel Aviv University, Google)
More from the Same Authors
-
2021 : Benchmark for Compositional Text-to-Image Synthesis »
Dong Huk Park · Samaneh Azadi · Xihui Liu · Trevor Darrell · Anna Rohrbach -
2023 Poster: LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections »
Muhammad Jehanzeb Mirza · Leonid Karlinsky · Wei Lin · Horst Possegger · Mateusz Kozinski · Rogerio Feris · Horst Bischof -
2023 Poster: Hierarchical Open-vocabulary Universal Image Segmentation »
Xudong Wang · Shufan Li · Konstantinos Kallidromitis · Yusuke Kato · Kazuki Kozuka · Trevor Darrell -
2023 Poster: Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence »
Grace Luo · Lisa Dunlap · Dong Huk Park · Aleksander Holynski · Trevor Darrell -
2023 Poster: Big Little Transformer Decoder »
Sehoon Kim · Karttikeya Mangalam · Suhong Moon · John Canny · Jitendra Malik · Michael Mahoney · Amir Gholami · Kurt Keutzer -
2023 Poster: Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models »
Sivan Doveh · Assaf Arbelle · Sivan Harary · Roei Herzig · Donghyun Kim · Paola Cascante-Bonilla · Amit Alfassy · Rameswar Panda · Raja Giryes · Rogerio Feris · Shimon Ullman · Leonid Karlinsky -
2023 Poster: Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation »
Lisa Dunlap · Alyssa Umino · Han Zhang · Jiezhi Yang · Joseph Gonzalez · Trevor Darrell -
2023 Poster: Language Models are Visual Reasoning Coordinators »
Liangyu Chen · Bo Li · Sheng Shen · Jingkang Yang · Chunyuan Li · Kurt Keutzer · Trevor Darrell · Ziwei Liu -
2023 Poster: Learning Human Action Recognition Representations Without Real Humans »
Howard Zhong · Samarth Mishra · Donghyun Kim · SouYoung Jin · Rameswar Panda · Hilde Kuehne · Leonid Karlinsky · Venkatesh Saligrama · Aude Oliva · Rogerio Feris -
2023 Poster: EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding »
Karttikeya Mangalam · Raiymbek Akshulakov · Jitendra Malik -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Squeezeformer: An Efficient Transformer for Automatic Speech Recognition »
Sehoon Kim · Amir Gholami · Albert Shaw · Nicholas Lee · Karttikeya Mangalam · Jitendra Malik · Michael Mahoney · Kurt Keutzer -
2022 Poster: Visual Prompting via Image Inpainting »
Amir Bar · Yossi Gandelsman · Trevor Darrell · Amir Globerson · Alexei Efros -
2022 Poster: How Transferable are Video Representations Based on Synthetic Data? »
Yo-whan Kim · Samarth Mishra · SouYoung Jin · Rameswar Panda · Hilde Kuehne · Leonid Karlinsky · Venkatesh Saligrama · Kate Saenko · Aude Oliva · Rogerio Feris -
2022 Poster: FETA: Towards Specializing Foundational Models for Expert Task Applications »
Amit Alfassy · Assaf Arbelle · Oshri Halimi · Sivan Harary · Roei Herzig · Eli Schwartz · Rameswar Panda · Michele Dolfi · Christoph Auer · Peter Staar · Kate Saenko · Rogerio Feris · Leonid Karlinsky -
2021 Poster: A Theoretical Analysis of Fine-tuning with Linear Teachers »
Gal Shachaf · Alon Brutzkus · Amir Globerson -
2021 Poster: Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data »
Ashraful Islam · Chun-Fu (Richard) Chen · Rameswar Panda · Leonid Karlinsky · Rogerio Feris · Richard J. Radke -
2021 Poster: CLIP-It! Language-Guided Video Summarization »
Medhini Narasimhan · Anna Rohrbach · Trevor Darrell -
2021 Poster: Early Convolutions Help Transformers See Better »
Tete Xiao · Mannat Singh · Eric Mintun · Trevor Darrell · Piotr Dollar · Ross Girshick -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2018 Poster: Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction »
Roei Herzig · Moshiko Raboh · Gal Chechik · Jonathan Berant · Amir Globerson -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell -
2017 Poster: Robust Conditional Probabilities »
Yoav Wald · Amir Globerson -
2010 Poster: Using body-anchored priors for identifying actions in single images »
Leonid Karlinsky · Michael Dinerstein · Shimon Ullman