Timezone: »

 
Improving Zero-shot Generalization and Robustness of Multi-modal Models
Yunhao Ge · Jie Ren · Ming-Hsuan Yang · Yuxiao Wang · Andrew Gallagher · Hartwig Adam · Laurent Itti · Balaji Lakshminarayanan · Jiaping Zhao

Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reason for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First,we develop a simple and efficient zero-shot post-hoc method to identify images where the top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we use information from parents in the hierarchy to add superclass to prompts, and use information from children in the hierarchy to devise fine-grained prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet-based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method consistently improvement on other ImageNet shifted datasets and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code for our experiments is opensourced at link hidden for anonymity.

Author Information

Yunhao Ge (University of Southern California)

I am now a 2nd-year PhD student in iLab at University of Southern University(USC), working with Prof. Laurent Itti. My research interests lie in Computer vision, Robotics and General AI. Currently, I am focusing on simulating baby learning (Reasoning, Attention, Imagination) by using various learning algorithms (Representation Learning, Adversarial Learning, Meta Learning, GNN, Reinforcement Learning, etc.).

Jie Ren (Google Inc.)
Ming-Hsuan Yang (Google)
Yuxiao Wang (Google Research)
Andrew Gallagher (Google)
Hartwig Adam (Google)
Laurent Itti (University of Southern California (USC))
Balaji Lakshminarayanan (Google Brain)

Balaji Lakshminarayanan is a research scientist at Google Brain. Prior to that, he was a research scientist at DeepMind. He received his PhD from the Gatsby Unit, University College London where he worked with Yee Whye Teh. His recent research has focused on probabilistic deep learning, specifically, uncertainty estimation, out-of-distribution robustness and deep generative models. Notable contributions relevant to the tutorial include developing state-of-the-art methods for calibration under dataset shift (such as deep ensembles and AugMix) and showing that deep generative models do not always know what they don't know. He has co-organized several workshops on "Uncertainty and Robustness in deep learning" and served as Area Chair for NeurIPS, ICML, ICLR and AISTATS.

Jiaping Zhao (Google Inc.)

More from the Same Authors