Timezone: »
For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks. We propose a zero-shot simple approach for one such task, Video Moment Retrieval (VMR), that does not perform any additional finetuning and simply repurposes off-the-shelf models trained on other tasks. Our three-step approach consists of moment proposal, moment-query matching and postprocessing, all using only off-the-shelf models. On the QVHighlights benchmark for VMR, we vastly improve performance of previous zero-shot approaches by at least 2.5x on all metrics and reduce the gap between zero-shot and state-of-the-art supervised by over 74%. Further, we also show that our zero-shot approach beats non-pretrained supervised models on the Recall metrics and comes very close on mAP metrics; and that it also performs better than the best pretrained supervised model on shorter moments. Finally, we ablate and analyze our results and propose interesting future directions.
Author Information
Anuj Diwan (University of Texas at Austin)
Puyuan Peng (University of Texas at Austin)
Raymond Mooney (University of Texas at Austin)
More from the Same Authors
-
2022 : Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks »
Albert Yu · Raymond Mooney -
2022 : Language-guided Task Adaptation for Imitation Learning »
Prasoon Goyal · Raymond Mooney · Scott Niekum -
2019 Poster: Self-Critical Reasoning for Robust Visual Question Answering »
Jialin Wu · Raymond Mooney -
2019 Spotlight: Self-Critical Reasoning for Robust Visual Question Answering »
Jialin Wu · Raymond Mooney -
2018 : Learning to Understand Natural Language Instructions through Human-Robot Dialog »
Raymond Mooney -
2017 : Panel Discussion »
Felix Hill · Olivier Pietquin · Jack Gallant · Raymond Mooney · Sanja Fidler · Chen Yu · Devi Parikh -
2017 : Visually Grounded Language: Past, Present, and Future... »
Raymond Mooney -
2015 : Generating Natural-Language Video Descriptions using LSTM Recurrent Neural Networks »
Raymond Mooney -
2011 Workshop: Integrating Language and Vision »
Raymond Mooney · Trevor Darrell · Kate Saenko