Timezone: »
Detecting customized moments and highlights from videos given natural language (NL) user queries is an important but under-studied topic. One of the challenges in pursuing this direction is the lack of annotated data. To address this issue, we present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips. This comprehensive annotation enables us to develop and evaluate systems that detect relevant moments as well as salient highlights for diverse, flexible user queries. We also present a strong baseline for this task, Moment-DETR, a transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end. While our model does not utilize any human prior, we show that it performs competitively when compared to well-engineered architectures. With weakly supervised pretraining using ASR captions, Moment-DETR substantially outperforms previous methods. Lastly, we present several ablations and visualizations of Moment-DETR. Data and code is publicly available at https://github.com/jayleicn/moment_detr.
Author Information
Jie Lei (Department of Computer Science, UNC, Chapel Hill)
Tamara L Berg (Stony Brook University)
Mohit Bansal (UNC Chapel Hill)
More from the Same Authors
-
2021 : VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation »
Linjie Li · Jie Lei · Zhe Gan · Licheng Yu · Yen-Chun Chen · Rohit Pillai · Yu Cheng · Luowei Zhou · Xin Wang · William Yang Wang · Tamara L Berg · Mohit Bansal · Jingjing Liu · Lijuan Wang · Zicheng Liu -
2022 : LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2023 Poster: Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation »
Jaemin Cho · Abhay Zala · Mohit Bansal -
2023 Poster: Resolving Interference When Merging Models »
Prateek Yadav · Derek Tam · Leshem Choshen · Colin Raffel · Mohit Bansal -
2023 Poster: PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation »
Jialu Li · Mohit Bansal -
2023 Poster: Self-Chained Image-Language Model for Video Localization and Question Answering »
Shoubin Yu · Jaemin Cho · Prateek Yadav · Mohit Bansal -
2023 Poster: Paxion: Patching Action Knowledge in Video-Language Foundation Models »
Zhenhailong Wang · Ansel Blume · Sha Li · Genglin Liu · Jaemin Cho · Zineng Tang · Mohit Bansal · Heng Ji -
2023 Poster: Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind »
Swarnadeep Saha · Peter Hase · Mohit Bansal -
2023 Poster: Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models »
Peter Hase · Mohit Bansal · Been Kim · Asma Ghandeharioun -
2023 Poster: Adaptive Contextual Perception: How To Generalize To New Backgrounds and Ambiguous Objects »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2023 Poster: Any-to-Any Generation via Composable Diffusion »
Zineng Tang · Ziyi Yang · Chenguang Zhu · Michael Zeng · Mohit Bansal -
2022 Poster: TVLT: Textless Vision-Language Transformer »
Zineng Tang · Jaemin Cho · Yixin Nie · Mohit Bansal -
2022 Poster: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners »
Zhenhailong Wang · Manling Li · Ruochen Xu · Luowei Zhou · Jie Lei · Xudong Lin · Shuohang Wang · Ziyi Yang · Chenguang Zhu · Derek Hoiem · Shih-Fu Chang · Mohit Bansal · Heng Ji -
2022 Poster: LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning »
Yi-Lin Sung · Jaemin Cho · Mohit Bansal -
2022 Poster: Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning »
Haokun Liu · Derek Tam · Mohammed Muqeeth · Jay Mohta · Tenghao Huang · Mohit Bansal · Colin Raffel -
2022 Poster: VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives »
Zhuofan Ying · Peter Hase · Mohit Bansal -
2022 Poster: WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models »
Yonatan Bitton · Nitzan Bitton Guetta · Ron Yosef · Yuval Elovici · Mohit Bansal · Gabriel Stanovsky · Roy Schwartz -
2021 Poster: The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations »
Peter Hase · Harry Xie · Mohit Bansal -
2021 Poster: VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer »
Zineng Tang · Jaemin Cho · Hao Tan · Mohit Bansal -
2020 Workshop: HAMLETS: Human And Model in the Loop Evaluation and Training Strategies »
Divyansh Kaushik · Bhargavi Paranjape · Forough Arabshahi · Yanai Elazar · Yixin Nie · Max Bartolo · Polina Kirichenko · Pontus Lars Erik Saito Stenetorp · Mohit Bansal · Zachary Lipton · Douwe Kiela -
2017 Demonstration: Interactive-Length Multi-Task Video Captioning with Cooperative Feedback »
Han Guo · Ramakanth Pasunuru · Mohit Bansal -
2011 Poster: Im2Text: Describing Images Using 1 Million Captioned Photographs »
Vicente Ordonez · Girish Kulkarni · Tamara L Berg -
2011 Spotlight: Im2Text: Describing Images Using 1 Million Captioned Photographs »
Vicente Ordonez · Girish Kulkarni · Tamara L Berg