Poster
in
Workshop: MATH-AI: The 5th Workshop on Mathematical Reasoning and AI

Learning to Reason on Hard Problems with Privileged On-Policy Exploration

Yuxiao Qu ⋅ Amrith Setlur ⋅ Virginia Smith ⋅ Ruslan Salakhutdinov ⋅ Aviral Kumar

Project Page [ OpenReview]

Abstract

Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods cannot use all training problems in a training dataset. On-policy RL rarely produces even a single correct rollout on hard problems, yielding no reward signal or learning altogether. Moreover, mixing easy problems into the training set can detrimental as on-policy RL may derive a larger signal to sharpen its distribution from these problems, impairing its ability to solve harder problems reliably. While one might attempt to address this by distilling human- or model-written solutions into models, these traces are not only expensive and hard to write, but also serve as poor fine-tuning targets: while they produce correct outputs, these concise paths are extremely challenging to learn from. We introduce Privileged On-Policy Exploration (POPE), a framework that leverages already available solutions from humans or other models to obtain a learning signal on hard problems by using them as "privileged" information that guides exploration. Concretely, POPE augments hard prompts with a minimal solution prefix as guidance, enabling RL to obtain non-zero rewards when rolling out conditioned on this prefix. We show that this approach allows RL to acquire behaviors that transfer back to original problems. This process expands the set of solvable problems and improves performance on challenging reasoning benchmarks.

Chat is not available.