Skip to yearly menu bar Skip to main content


Databricks

Expo Talk Panel

Optimizing and Reasoning about LLM Inference: from First Principles to SOTA Techniques

Linden Li


Abstract:

Large language models have achieved impressive results and are now frequently deployed in production settings. As a result, serving these models has become increasingly costly relative to training, making performance optimizations a ripe area for research. This talk will develop a first-principles approach to reasoning about large language model inference arithmetic from the ground up, covering topics including performance metrics to be aware of and methods to estimate inference latency. It will use this framework to analyze promising directions of future inference research.

Chat is not available.