Timezone: »
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression. Existing approaches mainly treat this complicated task as a parallel frame-grounding problem and thus suffer from two types of inconsistency drawbacks: feature alignment inconsistency and prediction inconsistency. In this paper, we present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT), to alleviate these issues. Specially, we introduce a novel multi-modal template as the global objective to address this task, which explicitly constricts the grounding region and associates the predictions among all video frames. Moreover, to generate the above template under sufficient video-textual perception, an encoder-decoder architecture is proposed for effective global context modeling. Thanks to these critical designs, STCAT enjoys more consistent cross-modal feature alignment and tube prediction without reliance on any pre-trained object detectors. Extensive experiments show that our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks (VidSTG and HC-STVG), illustrating the superiority of the proposed framework to better understanding the association between vision and natural language. Code is publicly available at https://github.com/jy0205/STCAT.
Author Information
Yang Jin (Peking University)
yongzhi li (Bytedance)
Zehuan Yuan (Nanjing University)
Yadong Mu (Peking University)
Related Events (a corresponding poster, oral, or spotlight)
-
2022 Poster: Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding »
Dates n/a. Room
More from the Same Authors
-
2022 Poster: QueryPose: Sparse Multi-Person Pose Regression via Spatial-Aware Part-Level Query »
Yabo Xiao · Kai Su · Xiaojuan Wang · Dongdong Yu · Lei Jin · Mingshu He · Zehuan Yuan -
2022 Spotlight: Lightning Talks 3A-4 »
Jinzhi Zhang · Hao Jiang · Hongrui Cai · Qi Yi · Yang Jin · Zhi Tian · Rui Zhang · Wanquan Feng · Xiangxiang Chu · Ruofan Tang · yongzhi li · Yadong Mu · Zehuan Yuan · shaohui peng · Zheng Cao · Xiaoming Wang · Xuetao Feng · Xiaolin Wei · Jiaming Guo · Yadong Mu · Yan Wang · Jing Xiao · Xing Hu · Chunhua Shen · Ruqi Huang · Juyong Zhang · Zidong Du · LU FANG · xishan zhang · Qi Guo · Yunji Chen -
2022 Spotlight: Conditional Diffusion Process for Inverse Halftoning »
Hao Jiang · Yadong Mu -
2022 Poster: Conditional Diffusion Process for Inverse Halftoning »
Hao Jiang · Yadong Mu -
2022 Poster: Rethinking Resolution in the Context of Efficient Video Recognition »
Chuofan Ma · Qiushan Guo · Yi Jiang · Ping Luo · Zehuan Yuan · Xiaojuan Qi -
2021 Poster: Disentangled Contrastive Learning on Graphs »
Haoyang Li · Xin Wang · Ziwei Zhang · Zehuan Yuan · Hang Li · Wenwu Zhu