Timezone: »
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-formvideo-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capturethe rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.
Author Information
Yuchong Sun (Renmin University of China)
Hongwei Xue (University of Science and Technology of China)
Ruihua Song (Renmin University of China)
Bei Liu (Microsoft Research Asia)
Huan Yang (Microsoft Research)
Jianlong Fu (Microsoft Research)
More from the Same Authors
-
2021 Poster: Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers »
Yanhong Zeng · Huan Yang · Hongyang Chao · Jianbo Wang · Jianlong Fu -
2021 Poster: Searching the Search Space of Vision Transformer »
Minghao Chen · Kan Wu · Bolin Ni · Houwen Peng · Bei Liu · Jianlong Fu · Hongyang Chao · Haibin Ling -
2021 Poster: Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training »
Hongwei Xue · Yupan Huang · Bei Liu · Houwen Peng · Jianlong Fu · Houqiang Li · Jiebo Luo -
2020 Poster: Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search »
Houwen Peng · Hao Du · Hongyuan Yu · QI LI · Jing Liao · Jianlong Fu -
2020 Poster: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2020 Spotlight: Learning Semantic-aware Normalization for Generative Adversarial Networks »
Heliang Zheng · Jianlong Fu · Yanhong Zeng · Jiebo Luo · Zheng-Jun Zha -
2019 Poster: Learning Deep Bilinear Transformation for Fine-grained Image Representation »
Heliang Zheng · Jianlong Fu · Zheng-Jun Zha · Jiebo Luo