Re-FORC: Adaptive Reward Forecasting for Efficient Chain-of-Thought Reasoning
Abstract
We propose Re-FORC, an adaptive reward forecasting method that, given a context, enables prediction of the expected future rewards as a function of the number of future thinking tokens. Re-FORC trains a lightweight adapter on reasoning models, demonstrating improved forecasting with longer reasoning and larger models. Re-FORC enables: 1) \textit{early stopping} of unpromising reasoning chains, reducing compute by 26\% while maintaining accuracy, 2) \textit{optimized model and thinking length selection} that achieves 4\% higher accuracy at equal compute and 55\% less compute at equal accuracy compared to the largest model, 3) \textit{adaptive test-time scaling} which increases accuracy by 11\% in high compute regime, and 7\% under low compute regime. Re-FORC enables dynamic reasoning length control via cost-per-token thresholds while estimating computation time upfront.