Expo Talk Panel
Room 206 - 207

Large language models have achieved impressive results and are now frequently deployed in production settings. As a result, serving these models has become increasingly costly relative to training, making performance optimizations a ripe area for research. This talk will develop a first-principles approach to reasoning about large language model inference arithmetic from the ground up, covering topics including performance metrics to be aware of and methods to estimate inference latency. It will use this framework to analyze promising directions of future inference research.

Chat is not available.