Skip to yearly menu bar Skip to main content


M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Bizhe Bai ⋅ Hongming Wu ⋅ Peng Ye ⋅ Tao Chen

Abstract

Chat is not available.