M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization
Abstract
Self-supervised reinforcement learning (RL) presents a promising avenue for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, existing methods are prone to "policy collapse," a phenomenon where the learning process becomes unstable during extended training, leading to a sharp degradation in both reward and task performance. This paper diagnoses this instability, attributing it to the lack of a stable target in self-rewarding systems. To address this, we introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a momentum-anchored method that leverages a slowly evolving momentum model to provide a consistent and reliable training signal, stabilizing the generation of pseudo-labels for policy optimization. Our experiments, conducted on the MATH dataset, demonstrate that M-GRPO effectively prevents policy collapse, maintaining a stable training reward and consistently high validation accuracy. The code is available at https://github.com/baibizhe/M_GRPO/tree/main.