E1: Controlling the Effort of a Reasoning Model through Reinforcement Learning
Abstract
Current reasoning models can improve performance by increasing the amount of chain-of-thought computation before producing an answer, but there are limited approaches for controlling the number of generated chain-of-thought tokens. Recent approaches require that a user specifies the absolute number of desired tokens, but such approaches require knowing the difficulty of the problem beforehand.We propose a reinforcement-learning (RL) method to enable control over the ``effort" -- or the \emph{relative} amount of tokens to apply for a query at inference time. Importantly, after training with our objective, at maximal effort the amount of generated tokens is adapted to the difficulty of the problem. We implement our objective by using an example and effort-dependent reward function and augmenting the prompt with a variable effort parameter. %on a per-example basis. Our resulting objective is directly compatible with standard RL training approaches like GRPO. After training, we observe monotonically increasing performance and number of generated tokens with increasing amount of effort. Moreover, we show across model scales that our approach enables an approximately 3x reduction of chain-of-thought length while maintaining or improving performance relative to the base model used for our RL training.