Expo Demonstration
La Nouvelle Orleans Ballroom A-C (level 2)

Large language models (LLMs) have become universal and versatile tools with increasing demand to run them directly on user devices such as smartphones. However, deploying such models on edge devices is challenging due to memory-bound processing caused by their huge parameter counts and autoregressive nature of inference. We have developed two approaches to address these challenges, enabling fast and accurate inference on such edge devices. First, to reduce the computational time and memory footprint of the LLaMA2-Chat 7B target model so that it can be fit on the Snapdragon Mobile Platform, we use the AI Model Efficiency Toolkit (AIMET) for 4-bit weight quantization and 16-bit integer activation quantization. In addition, to retain the performance characteristics of the best floating-point “chat” models, we add a knowledge distillation component to our Quantization-aware Training/Tuning (QAT) to encourage the final quantized model to produce outputs comparable to the best floating-point models with minimal reduction of text generation quality and benchmark accuracy. This is important because the best chat performance of available models relies on fine-tuning methods and datasets that are not publicly available. To further mitigate the inference speed bottleneck caused by memory-bound processing, we equipped the LLaMA2-Chat 7B with speculative decoding. Since a much smaller draft model is required for speculative decoding and the LLaMA2 model family has 7B parameters as its smallest variant, we trained LLaMA2-Chat-Drafter-115M with only 2% of the size of the target model with knowledge distillation from the target model. On Snapdragon® 8 Gen 3 Mobile Platform, with an 8-bit weight quantization of our draft model, we demonstrate 2x inference speed-up without sacrificing text generation quality and benchmark accuracy. Overall, the two methods together, QAT and speculative decoding, lead to efficient on-device performance with minimal reduction of accuracy.

Chat is not available.