NeurIPS Expo Demonstration Disaggregated LLM Serving on AI Accelerators

Expo Demonstration

Disaggregated LLM Serving on AI Accelerators

Parmeet Kohli ⋅ Suman Gunnala ⋅ Ankit Arora

[ Abstract ]

Tue 2 Dec noon PST — 3 p.m. PST

Abstract:

This demo showcases disaggregated serving on Qualcomm Cloud AI 100 Ultra Card, a power-efficient AI inference accelerator purpose-built for large language models (LLMs) serving. The accelerator has been deployed across multiple cloud service providers (CSPs) globally and is actively serving state-of-the-art LLMs and other generative AI workloads. x000D x000D LLM inference typically involves two distinct stages: prefill and decode. The prefill stage is compute bound, while the decode stage is memory bound. Applying uniform parallelism strategies across both stages often results in suboptimal performance, particularly in key metrics such as Time to First Token (TTFT) and Requests Per Minute (RPM) at the cluster level. x000D x000D This demo highlights the performance benefits of disaggregated parallelism strategies tailored to the unique characteristics of each stage. By optimizing the execution of prefill and decode independently, we demonstrate significant improvements in TTFT and overall throughput. x000D x000D Key benefits: x000D x000D Improved TTFT: Faster initial response times for LLM queries. x000D x000D Higher throughput: Increased number of requests served per minute at the cluster level. x000D x000D Optimized resource utilization: Efficient mapping of compute and memory resources to match workload characteristics. x000D x000D SLA-adherent performance: Maintains service quality and responsiveness within strict latency and throughput requirements.

Chat is not available.