Skip to yearly menu bar Skip to main content


Poster

Sparse High Rank Adapters

Kartikeya Bhardwaj · Nilesh Pandey · Sweta Priyadarshi · Rafael Esteves · Viswanath Ganapathy · Shreya Kadambi · Shubhankar Borse · Paul Whatmough · Risheek Garrepalli · Mart van Baalen · Markus Nagel · Harris Teague

[ ]
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Low Rank Adaptation (LoRA) has gained massive attention in the recent generative AI research. One of the main advantages of LoRA is its ability to be fused with pretrained models adding no overhead during inference. However, from a mobile deployment standpoint, we can either avoid inference overhead in the fused mode but lose the ability to switch adapters rapidly, or suffer significant (up to 30% higher) inference latency while enabling rapid adapter switching in the unfused mode. LoRA also exhibits concept-loss when multiple adapters are used concurrently. We propose Sparse High Rank Adapter (SHiRA), a new adapter which incurs no inference overhead, enables rapid adapter switching by directly switching adapters in the fused mode, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning 1-2% of the base model weights while leaving others unchanged. This results in a highly sparse adapter which can be switched directly in the fused mode. We further provide theoretical and empirical insights towards how high sparsity in SHiRA can aid multi-adapter fusion by reducing concept loss. Our extensive experiments on LVMs and LLMs demonstrate that finetuning merely 1-2% parameters in the base model is sufficient for many adapter tasks while enabling both rapid adapter switching and multi-adapter fusion.

Live content is unavailable. Log in and register to view live content