Poster
Compressing Large Language Models using Low Rank and Low Precision Decomposition
Rajarshi Saha · Naomi Sagan · Varun Srivastava · Andrea Goldsmith · Mert Pilanci
East Exhibit Hall A-C #1910
Abstract:
The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix by approximating it via a low-rank, low-precision decomposition as . Here, and are low rank factors, and the entries of , and are quantized. The model is compressed by substituting each layer with its decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, and are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. obtains this decomposition by formulating it as an optimization problem , where is the calibration data, and are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa- B/B/B and LlaMa- B models obtained using outperforms existing post-training LLM compression techniques in the regime of less than bits per parameter.
Chat is not available.