Scale-Dependent Elicitation of Reasoning in LLMs
Abstract
Frontier labs employ computationally intensive reinforcement learning pipelines to enhance base models' performance on mathematical and logical reasoning tasks. However, the extent to which these improvements represent newly learned capabilities versus activation of latent abilities remains unclear. We address this question by demonstrating that both sample-efficient and parameter-efficient training methods successfully elicit reasoning capabilities, but only from sufficiently large base models. We find that finetuning with as few as 29 DeepSeek R1 reasoning traces is sufficient to recover substantial reasoning performance in the 32B parameter version of Qwen2.5-Instruct, while the 1.5B and 7B versions see small or negative gains despite improvements in validation loss. Furthermore, we show that a rank-1 LoRA with just 0.03\% as many trainable parameters achieves substantial performance improvements on 32B models, demonstrating that these gains are not simply a consequence of larger models having greater training capacity. Our findings reveal that reasoning capabilities can be efficiently elicited from large models through minimal data and parameter updates, while smaller models fail to benefit from these same efficient methods. This stark difference in learning efficiency suggests that general reasoning capabilities may already be partially latent in larger models, with important implications for our understanding of how capabilities emerge with scale.