Timezone: »
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.
Author Information
Medhini Narasimhan (UC Berkeley)
CS Graduate student at the University of Illinois Urbana Champaign pursuing Computer Vision and Deep Learning research.
Anna Rohrbach (UC Berkeley)
Trevor Darrell (Electrical Engineering & Computer Science Department)
More from the Same Authors
-
2021 : Benchmark for Compositional Text-to-Image Synthesis »
Dong Huk Park · Samaneh Azadi · Xihui Liu · Trevor Darrell · Anna Rohrbach -
2023 Poster: Hierarchical Open-vocabulary Universal Image Segmentation »
Xudong Wang · Shufan Li · Konstantinos Kallidromitis · Yusuke Kato · Kazuki Kozuka · Trevor Darrell -
2023 Poster: Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence »
Grace Luo · Lisa Dunlap · Dong Huk Park · Aleksander Holynski · Trevor Darrell -
2023 Poster: Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation »
Lisa Dunlap · Alyssa Umino · Han Zhang · Jiezhi Yang · Joseph Gonzalez · Trevor Darrell -
2023 Poster: Language Models are Visual Reasoning Coordinators »
Liangyu Chen · Bo Li · Sheng Shen · Jingkang Yang · Chunyuan Li · Kurt Keutzer · Trevor Darrell · Ziwei Liu -
2022 Poster: K-LITE: Learning Transferable Visual Models with External Knowledge »
Sheng Shen · Chunyuan Li · Xiaowei Hu · Yujia Xie · Jianwei Yang · Pengchuan Zhang · Zhe Gan · Lijuan Wang · Lu Yuan · Ce Liu · Kurt Keutzer · Trevor Darrell · Anna Rohrbach · Jianfeng Gao -
2022 Poster: Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens »
Elad Ben Avraham · Roei Herzig · Karttikeya Mangalam · Amir Bar · Anna Rohrbach · Leonid Karlinsky · Trevor Darrell · Amir Globerson -
2022 Poster: Visual Prompting via Image Inpainting »
Amir Bar · Yossi Gandelsman · Trevor Darrell · Amir Globerson · Alexei Efros -
2021 Poster: Multi-Person 3D Motion Prediction with Multi-Range Transformers »
Jiashun Wang · Huazhe Xu · Medhini Narasimhan · Xiaolong Wang -
2021 Poster: Early Convolutions Help Transformers See Better »
Tete Xiao · Mannat Singh · Eric Mintun · Trevor Darrell · Piotr Dollar · Ross Girshick -
2021 Poster: Teachable Reinforcement Learning via Advice Distillation »
Olivia Watkins · Abhishek Gupta · Trevor Darrell · Pieter Abbeel · Jacob Andreas -
2018 Poster: Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering »
Medhini Narasimhan · Svetlana Lazebnik · Alex Schwing -
2018 Poster: Speaker-Follower Models for Vision-and-Language Navigation »
Daniel Fried · Ronghang Hu · Volkan Cirik · Anna Rohrbach · Jacob Andreas · Louis-Philippe Morency · Taylor Berg-Kirkpatrick · Kate Saenko · Dan Klein · Trevor Darrell