Timezone: »
Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this talk, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training method to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. The resultant model FILIP and Wukong achieve good performance on multiple downstream vision-language tasks, while maintaining the inference efficiency of dual-stream models. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability. Furthermore, we release a 100 million Chinese image-text pair dataset for pre-training.
Author Information
Lu Hou (Huawei Technologies Co., Ltd)
Lu Hou (Huawei Technologies Co., Ltd)
More from the Same Authors
-
2022 Poster: Towards Efficient Post-training Quantization of Pre-trained Language Models »
Haoli Bai · Lu Hou · Lifeng Shang · Xin Jiang · Irwin King · Michael R Lyu -
2022 Poster: Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark »
Jiaxi Gu · Xiaojun Meng · Guansong Lu · Lu Hou · Niu Minzhe · Xiaodan Liang · Lewei Yao · Runhui Huang · Wei Zhang · Xin Jiang · Chunjing XU · Hang Xu -
2021 : Panel Discussion »
Pascal Poupart · Ali Ghodsi · Luke Zettlemoyer · Sameer Singh · Kevin Duh · Yejin Choi · Lu Hou -
2021 : Compression and Acceleration of Pre-trained Language Models »
Lu Hou -
2020 Poster: DynaBERT: Dynamic BERT with Adaptive Width and Depth »
Lu Hou · Zhiqi Huang · Lifeng Shang · Xin Jiang · Xiao Chen · Qun Liu -
2020 Spotlight: DynaBERT: Dynamic BERT with Adaptive Width and Depth »
Lu Hou · Zhiqi Huang · Lifeng Shang · Xin Jiang · Xiao Chen · Qun Liu -
2019 Poster: Normalization Helps Training of Quantized LSTM »
Lu Hou · Jinhua Zhu · James Kwok · Fei Gao · Tao Qin · Tie-Yan Liu