Poster
Homology Consistency Constrained Efficient Tuning for Vision-Language Models
Huatian Zhang · Lei Zhang · Yongdong Zhang · Zhendong Mao
East Exhibit Hall A-C #3501
Efficient transfer learning has shown remarkable performance in tuning large-scale vision-language models (VLMs) toward downstream tasks with limited data resources. The key challenge of efficient transfer lies in adjusting image-text alignment to be task-specific while preserving pre-trained general knowledge. However, existing methods adjust image-text alignment merely on a set of observed samples, e.g., data set and external knowledge base, which cannot guarantee to keep the correspondence of general concepts between image and text latent manifolds without being disrupted and thereby a weak generalization of the adjusted alignment. In this work, we propose a Homology Consistency (HC) constraint for efficient transfer on VLMs, which explicitly constrains the correspondence of image and text latent manifolds through structural equivalence based on persistent homology in downstream tuning. Specifically, we build simplicial complex on the top of data to mimic the topology of latent manifolds. We then track the persistence of the homology classes of topological features across multiple scales and guide the directions of persistence tracks in image and text manifolds to coincide each other. Additionally, we apply a deviating perturbation to generalize the persistence coincidence to unseen data. Further, we tailor the implementation of our proposed HC constraint for two main paradigms of adapter tuning respectively. Extensive experiments on few-shot learning on 11 datasets and domain generalization demonstrate the effectiveness and robustness of our method. All code will be released.
Live content is unavailable. Log in and register to view live content