NeurIPS Poster Compressing Large Language Models using Low Rank and Low Precision Decomposition

Poster

Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha · Naomi Sagan · Varun Srivastava · Andrea Goldsmith · Mert Pilanci

East Exhibit Hall A-C #1910

[ Abstract ]

[ Paper] [ Slides] [ Poster] [ OpenReview]

Thu 12 Dec 11 a.m. PST — 2 p.m. PST

Abstract: The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces

C A L D E R A

$\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix

W

$\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as

W \approx Q + L R

$\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$ . Here,

L

$\mathbf{L}$ and

R

$\mathbf{R}$ are low rank factors, and the entries of

Q

$\mathbf{Q}$ ,

L

$\mathbf{L}$ and

R

$\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its

Q + L R

$\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally,

L

$\mathbf{L}$ and

R

$\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance.

C A L D E R A

$\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem

min_{Q, L, R} ‖ (Q + L R - W) X^{⊤} ‖_{F}^{2}

$\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}^\top\rVert_{\rm F}^2$ , where

X

$\mathbf{X}$ is the calibration data, and

Q, L, R

$\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of

C A L D E R A

$\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-

2

$2$

7

$7$ B/

13

$13$ B/

70

$70$ B and LlaMa-

3

$3$

8

$8$ B models obtained using

C A L D E R A

$\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than

2.5

$2.5$ bits per parameter.

Chat is not available.