Teaching Pretrained Language Models to Think Deeper with Retrofitted Recurrence
Sean McLeish · Leon Li · John Kirchenbauer · Dayal Singh Kalra · Brian Bartoldson · Bhavya Kailkhura · Avi Schwarzschild · Jonas Geiping · Micah Goldblum · Tom Goldstein
Abstract
Recent advances in depth-recurrent language models show that recurrence can decouple train-time compute and parameter count from test-time compute.In this work, we study how to convert existing pretrained non-recurrent language models into depth-recurrent models.We find that using a curriculum of recurrences to increase the effective depth of the model over the course of training preserves performance while reducing total computational cost.In our experiments on grade-school math, we observe that converting pretrained models to recurrent ones results in better performance at a given compute budget than simply post-training the original non-recurrent language model.
Chat is not available.
Successful Page Load