Poster
in
Workshop: The First Workshop on Efficient Reasoning

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Bizhe Bai · Hongming Wu · Peng Ye · Tao Chen

Project Page [ OpenReview]

Abstract

Self-supervised reinforcement learning (RL) presents a promising avenue for enhancing the reasoning capabilities of Large Language Models (LLMs) without reliance on expensive human-annotated data. However, existing methods are prone to "policy collapse," a phenomenon where the learning process becomes unstable during extended training, leading to a sharp degradation in both reward and task performance. This paper diagnoses this instability, attributing it to the lack of a stable target in self-rewarding systems. To address this, we introduce M-GRPO (Momentum-Anchored Group Relative Policy Optimization), a momentum-anchored method that leverages a slowly evolving momentum model to provide a consistent and reliable training signal, stabilizing the generation of pseudo-labels for policy optimization. Our experiments, conducted on the MATH dataset, demonstrate that M-GRPO effectively prevents policy collapse, maintaining a stable training reward and consistently high validation accuracy. The code is available at https://github.com/baibizhe/M_GRPO/tree/main.

Chat is not available.