Skip to yearly menu bar Skip to main content


Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models

Deepak Narayanan · Keshav Santhanam · Peter Henderson · Peter Henderson · Rishi Bommasani · Tony Lee · Percy Liang

Great Hall & Hall B1+B2 (level 1) #2010
[ ]
Tue 12 Dec 8:45 a.m. PST — 10:45 a.m. PST


Large language models (LLMs) are highly capable but also computationally expensive. Characterizing the fundamental tradeoff between inference efficiency and model capabilities is thus important, but requires an efficiency metric that is comparable across models from different providers.Unfortunately, raw runtimes measured through black-box APIs do not satisfy this property: model providers can implement software and hardware optimizations orthogonal to the model, and shared infrastructure introduces performance contention.We propose a new metric for inference efficiency called idealized runtime, that puts models on equal footing as though they were served on uniform hardware and software without performance contention, and a cost model to efficiently estimate this metric for autoregressive Transformer models.We also propose variants of the idealized runtime that incorporate the number and type of accelerators needed to serve the model.Using these metrics, we compare ten LLMs developed in 2022 to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model.Our code is open sourced at

Chat is not available.