Skip to yearly menu bar Skip to main content


Poster

Exploring & Improving Multi-token Prediction (Block Draft) in Language Modeling

Taehyeon Kim · Ananda Theertha Suresh · Kishore Papineni · Michael D Riley · Sanjiv Kumar · Adrian Benton

[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Despite the remarkable strides made by autoregressive language models, their potential is often hampered by the slow inference speeds inherent in sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified and selectively accepted by the autoregressive model. Block drafts are generated by multiple independent prediction heads of blockwise parallel language models. This paper contributes to the understanding and improvement of block drafts in two ways. First, we analyze the token distributions produced by multiple prediction heads. Secondly, we leverage this analysis to develop algorithms to improve BPD inference speed by refining the block drafts using n-gram and neural language models. Experiments demonstrate that refined block drafts yield a +5-21% increase in block efficiency (i.e., the number of accepted tokens from the block draft) across diverse datasets.

Live content is unavailable. Log in and register to view live content