Skip to yearly menu bar Skip to main content


Poster

Accelerating Best-of-N via Speculative Rejection

Ruiqi Zhang · Momin Haider · Ming Yin · Jiahao Qiu · Mengdi Wang · Peter Bartlett · Andrea Zanette

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

The safe and effective deployment of Large Language Models (LLMs) often involves generating helpful and benign responses, producing easily comprehensible code, and crafting content with specific stylistic preferences.While different, these tasks share the common mathematical goal of generating responses from a language model with high scores according to a metric of interest.A popular and well known decoding strategy for this purpose is the Best-of-N method. The method generates a pre-specified number of responses (N) based on a prompt, and then selects the highest-scoring response among them to be returned.While Best-of-N is both simple and effective, its reliance on generating multiple responses to score for any given prompt incurs high inference costs.In this paper we make a first step towards accelerating the Best-of-N algorithm, by halting the generation of unpromising utterances, namely those that are unlikely to be returned by the algorithm upon completion. Focusing on the alignment problem, we show that this simple strategy allows to obtain substantial speedups for the Best-of-N algorithm with minimal performance degradation.

Live content is unavailable. Log in and register to view live content