Poster
On the Inductive Bias of Stacking Towards Improving Reasoning
Nikunj Saunshi · Stefani Karp · Shankar Krishnan · Sobhan Miryoosefi · Sashank Jakkam Reddi · Sanjiv Kumar
East Exhibit Hall A-C #1900
Given the increasing scale of model sizes, novel training strategies like gradual stacking have garnered interest. Stacking enables efficient training by gradually growing the depth of a model in stages and using layers from a smaller model in an earlier stage to initialize the next stage. Although efficient for training, the model biases induced by such growing approaches is largely unexplored. In this work, we examine this fundamental aspect of gradual stacking, going beyond its efficiency benefits. We propose a variant of gradual stacking called MIDAS and discover an intriguing phenomenon for this approach: MIDAS is not only training efficient, but surprisingly also has an inductive bias towards improving downstream tasks, especially tasks that require reasoning abilities, despite having similar or slightly worse perplexity compared to baseline training. To further analyze this inductive bias, we construct {\em reasoning primitives} – simple synthetic tasks that are building blocks for reasoning – and find that a model pretrained with stacking is significantly better than standard pretraining on these primitives, with and without fine-tuning. This provides stronger and more robust evidence for this inductive bias towards reasoning. Furthermore, we conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models and provide strong supporting empirical analysis.
Live content is unavailable. Log in and register to view live content