Timezone: »
Large-scale vision-language pre-training has achieved promising results on downstream tasks. Existing methods highly rely on the assumption that the image-text pairs crawled from the Internet are in perfect one-to-one correspondence. However, in real scenarios, this assumption can be difficult to hold: the text description, obtained by crawling the affiliated metadata of the image, often suffers from the semantic mismatch and the mutual compatibility. To address these issues, we introduce PyramidCLIP, which constructs an input pyramid with different semantic levels for each modality, and aligns visual elements and linguistic elements in the form of hierarchy via peer-level semantics alignment and cross-level relation alignment. Furthermore, we soften the loss of negative samples (unpaired samples) so as to weaken the strict constraint during the pre-training stage, thus mitigating the risk of forcing the model to distinguish compatible negative pairs. Experiments on five downstream tasks demonstrate the effectiveness of the proposed PyramidCLIP. In particular, with the same amount of 15 million pre-training image-text pairs, PyramidCLIP exceeds CLIP on ImageNet zero-shot classification top-1 accuracy by 10.6%/13.2%/10.0% with ResNet50/ViT-B32/ViT-B16 based image encoder respectively. When scaling to larger datasets, PyramidCLIP achieves the state-of-the-art results on several downstream tasks. In particular, the results of PyramidCLIP-ResNet50 trained on 143M image-text pairs surpass that of CLIP using 400M data on ImageNet zero-shot classification task, significantly improving the data efficiency of CLIP.
Author Information
Yuting Gao (Tencent Youtu Lab)
Jinfeng Liu (Shanghai Jiaotong University)
Zihan Xu (Tencent)
Jun Zhang (Tencent Youtu Lab)
Ke Li (Tencent)
Rongrong Ji (Xiamen University, China)
Chunhua Shen (University of Adelaide)
More from the Same Authors
-
2022 Poster: Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval »
Chengzhi Lin · Ancong Wu · Junwei Liang · Jun Zhang · Wenhang Ge · Wei-Shi Zheng · Chunhua Shen -
2022 Poster: Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach »
Peng Mi · Li Shen · Tianhe Ren · Yiyi Zhou · Xiaoshuai Sun · Rongrong Ji · Dacheng Tao -
2022 Poster: Multi-dataset Training of Transformers for Robust Action Recognition »
Junwei Liang · Enwei Zhang · Jun Zhang · Chunhua Shen -
2022 Poster: SegViT: Semantic Segmentation with Plain Vision Transformers »
Bowen Zhang · Zhi Tian · Quan Tang · Xiangxiang Chu · Xiaolin Wei · Chunhua Shen · Yifan liu -
2022 Poster: Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images »
Zhi Tian · Xiangxiang Chu · Xiaoming Wang · Xiaolin Wei · Chunhua Shen -
2022 Spotlight: Lightning Talks 6A-4 »
Xiu-Shen Wei · Konstantina Dritsa · Guillaume Huguet · ABHRA CHAUDHURI · Zhenbin Wang · Kevin Qinghong Lin · Yutong Chen · Jianan Zhou · Yongsen Mao · Junwei Liang · Jinpeng Wang · Mao Ye · Yiming Zhang · Aikaterini Thoma · H.-Y. Xu · Daniel Sumner Magruder · Enwei Zhang · Jianing Zhu · Ronglai Zuo · Massimiliano Mancini · Hanxiao Jiang · Jun Zhang · Fangyun Wei · Faen Zhang · Ioannis Pavlopoulos · Zeynep Akata · Xiatian Zhu · Jingfeng ZHANG · Alexander Tong · Mattia Soldan · Chunhua Shen · Yuxin Peng · Liuhan Peng · Michael Wray · Tongliang Liu · Anjan Dutta · Yu Wu · Oluwadamilola Fasina · Panos Louridas · Angel Chang · Manik Kuchroo · Manolis Savva · Shujie LIU · Wei Zhou · Rui Yan · Gang Niu · Liang Tian · Bo Han · Eric Z. XU · Guy Wolf · Yingying Zhu · Brian Mak · Difei Gao · Masashi Sugiyama · Smita Krishnaswamy · Rong-Cheng Tu · Wenzhe Zhao · Weijie Kong · Chengfei Cai · WANG HongFa · Dima Damen · Bernard Ghanem · Wei Liu · Mike Zheng Shou -
2022 Spotlight: Multi-dataset Training of Transformers for Robust Action Recognition »
Junwei Liang · Enwei Zhang · Jun Zhang · Chunhua Shen -
2022 Spotlight: Lightning Talks 3A-4 »
Jinzhi Zhang · Hao Jiang · Hongrui Cai · Qi Yi · Yang Jin · Zhi Tian · Rui Zhang · Wanquan Feng · Xiangxiang Chu · Ruofan Tang · yongzhi li · Yadong Mu · Zehuan Yuan · shaohui peng · Zheng Cao · Xiaoming Wang · Xuetao Feng · Xiaolin Wei · Jiaming Guo · Yadong Mu · Yan Wang · Jing Xiao · Xing Hu · Chunhua Shen · Ruqi Huang · Juyong Zhang · Zidong Du · LU FANG · xishan zhang · Qi Guo · Yunji Chen -
2022 Panel: Panel 3C-3: PyramidCLIP: Hierarchical Feature⦠& Flamingo: a Visual⦠»
Jeff Donahue · Yuting Gao -
2022 Spotlight: Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images »
Zhi Tian · Xiangxiang Chu · Xiaoming Wang · Xiaolin Wei · Chunhua Shen -
2022 Poster: Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face Recognition »
Shuai Jia · Bangjie Yin · Taiping Yao · Shouhong Ding · Chunhua Shen · Xiaokang Yang · Chao Ma -
2022 Poster: DENSE: Data-Free One-Shot Federated Learning »
Jie Zhang · Chen Chen · Bo Li · Lingjuan Lyu · Shuang Wu · Shouhong Ding · Chunhua Shen · Chao Wu -
2022 Poster: Learning Best Combination for Efficient N:M Sparsity »
Yuxin Zhang · Mingbao Lin · ZhiHang Lin · Yiting Luo · Ke Li · Fei Chao · Yongjian Wu · Rongrong Ji -
2022 Poster: Hierarchical Normalization for Robust Monocular Depth Estimation »
Chi Zhang · Wei Yin · Billzb Wang · Gang Yu · BIN FU · Chunhua Shen -
2020 Poster: Rotated Binary Neural Network »
Mingbao Lin · Rongrong Ji · Zihan Xu · Baochang Zhang · Yan Wang · Yongjian Wu · Feiyue Huang · Chia-Wen Lin -
2020 Poster: UWSOD: Toward Fully-Supervised-Level Capacity Weakly Supervised Object Detection »
Yunhang Shen · Rongrong Ji · Zhiwei Chen · Yongjian Wu · Feiyue Huang -
2019 Poster: Variational Structured Semantic Inference for Diverse Image Captioning »
Fuhai Chen · Rongrong Ji · Jiayi Ji · Xiaoshuai Sun · Baochang Zhang · Xuri Ge · Yongjian Wu · Feiyue Huang · Yan Wang -
2019 Poster: FreeAnchor: Learning to Match Anchors for Visual Object Detection »
Xiaosong Zhang · Fang Wan · Chang Liu · Rongrong Ji · Qixiang Ye -
2019 Poster: Information Competing Process for Learning Diversified Representations »
Jie Hu · Rongrong Ji · ShengChuan Zhang · Xiaoshuai Sun · Qixiang Ye · Chia-Wen Lin · Qi Tian -
2014 Poster: Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors »
Lingqiao Liu · Chunhua Shen · Lei Wang · Anton van den Hengel · Chao Wang -
2009 Poster: Positive Semidefinite Metric Learning with Boosting »
Chunhua Shen · Junae Kim · Lei Wang · Anton van den Hengel -
2008 Poster: PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning »
Chunhua Shen · Alan Welsh · Lei Wang