Talk

ATLAS: AdapTive-LeArning Speculator System for Real-Time LLM Inference Acceleration with Together AI

Junxiong Wang ⋅ Ben Athiwaratkun

2025 Talk

Abstract

We present ATLAS (AdapTive-LeArning Speculator System), a speculative decoding framework that achieves up to 4x faster LLM inference by learning from live traffic in real-time. ATLAS combines a static speculator for robust baseline performance with a lightweight adaptive speculator that rapidly specializes to emerging workload patterns. A confidence-aware controller dynamically selects between speculators and adjusts lookahead length to optimize acceptance rates. On DeepSeek-V3.1, ATLAS achieves up to 500 TPS, 3.18x faster than standard decoding. In RL training scenarios, acceptance rates improve from 10% to 80%, reducing training time by over 60%. Unlike static speculators that degrade as workloads evolve, ATLAS continuously adapts to maintain peak performance.

Video

Chat is not available.