Skip to yearly menu bar Skip to main content


Poster

Learning Where to Edit Vision Transformers

Yunqiao Yang · Long-Kai Huang · Shengzhuang Chen · Kede Ma · Ying Wei


Abstract:

The recent revelation of errors in large pre-trained models necessitates urgent post-hoc editing. Model editing seeks to efficiently correct prediction errors while warranting generalization to neighboring failure examples of the same underlying root cause and locality to avoid unnecessary editing on irrelevant examples. Despite its strides in large language models, optimal practices for editing vision Transformers (ViTs) in computer vision remain largely untapped. Our study addresses this gap by rectifying the predictive errors of ViTs, specifically those stemming from subpopulation shifts. We introduce a learning-to-learn approach that identifies a small set of critical parameters to edit in response to an erroneous sample, wherein the locations of these parameters are output by a hypernetwork. Through mimicking the edit process and explicitly optimizing towards edit success, the hypernetwork is trained to output reliable and generalizable editing locations. Additionally, the sparsity constraint on the hypernetwork guarantees the editing to be localized without distorting irrelevant parameters. To validate our approach, we curate two benchmarks that existing pre-trained ViTs struggle to predict correctly, on which our approach shows not only superior performance but also flexibility in providing customized solutions to meet a variety of application-specific requirements.

Live content is unavailable. Log in and register to view live content