DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification
Abstract
Speculative decoding accelerates LLM inference by letting a small draft model propose multiple tokens that a larger target verifies in parallel, but rigid verification that enforces exact distributional match rejects many plausible tokens and limits speed. We first introduce Static Ensemble, a training‑free fixed‑weight mixture of draft and target that provably traces the Pareto‑optimal trade‑off between rejection probability and distributional bias. To further raise acceptance without sacrificing quality, we propose Diversed (DynamIc VErification RElaxed SpEculative Decoding), which learns context‑dependent mixing weights to form a flexible verification target. This relaxed verification admits safe tokens more often while preserving correctness. Theory and experiments show that Diversed achieves significantly higher inference efficiency than conventional speculative decoding and the static baseline.