Skip to yearly menu bar Skip to main content


Poster

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha · Naomi Sagan · Varun Srivastava · Andrea Goldsmith · Mert Pilanci

East Exhibit Hall A-C #1910
[ ]
[ Paper [ Slides [ Poster [ OpenReview
Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract: The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a low-rank, low-precision decomposition as WQ+LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q+LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA obtains this decomposition by formulating it as an optimization problem minQ,L,R(Q+LRW)XF2, where X is the calibration data, and Q,L,R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-2 7B/13B/70B and LlaMa-3 8B models obtained using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter.

Chat is not available.