Poster
SIRIUS : Contexual Sparisty with Correction for Efficient LLMs
Yang Zhou · Zhuoming Chen · Zhaozhuo Xu · Victoria Lin · Beidi Chen
West Ballroom A-D #6906
As the growing of large language models(LLM), inference efficiency becomes increasingly important. Various approximate methods are proposed to reduce the cost at inference time, including contextual sparsity. With a thorough evaluation for contextual sparsity methods, we found that it helps reduce hallucination, but significantly degrades model performance for reasoning and deduction. However, despite the gap in end-to-end accuracy, we observed that sparse models and original models often share problems solving logic and only a few token corrections can recover the performance. This paper introduces \textsc{Sirius}, an efficient correction mechanism, which enables accurate LLM inference with contextual sparsity. With an additional 11\% - 18\% average parameters used per token increasing, \textsc{Sirius} improves performance on GSM8K for fine-grained sparsity methods from 0.5868 to 0.7263 and for coarse grained sparsity methods from 0.3859 to 0.6679.
Live content is unavailable. Log in and register to view live content