DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning
Abstract
We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first generalize the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is easy to estimate as an approximation to the true dLLM policy. This naturally motivates improved two-stage likelihood approximations combined with importance sampling correction, which leads to RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt, which yields even better accuracies with lower number of function evaluations (NFEs) compared to the base model. Finally, we consider joint training of the dLLM policy and the sampler together for obtaining the best performance in improving the Pareto frontier of the inference-time compute for dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.