MetroRL: Enabling Memory‑Effective Training for On‑Policy RLHF via Adaptive Sequence Streaming
Wei Cui
Abstract
Reinforcement learning from human feedback (RLHF) has become the standard post‑training technique for endowing large language models (LLMs) with helpful, harmless, and intent‑consistent behavior. In practice, however, its adoption is hampered by prohibitive memory consumption during the phase of the policy‑model update, especially when training on long‑form generation tasks. In this paper, we propose MetroRL, a memory‑efficient, on‑policy RLHF approach that exploits the inference-time computations to reduce the training-time memory budget and to skip unnecessary work. By re‑using the inference-phase materialized $K, V$ context, the inter‑token dependencies are freely removed that normally force the entire sequence to train in parallel. Building upon fine‑grained subsequence streaming, RLHF can train the productive tokens in an effective manner. This yields a training pipeline that matches the exact behavior of conventional full‑sequence RLHF while using less memory and incurring no arithmetic recomputation.Experiments on the Qwen‑3 models demonstrate that MetroRL rescheduled algorithm reduces peak training memory usage to 1/3.8 to 1/5.9, enabling not only memory-effective but also semantic-reliable fine‑tuning for LLM.
Chat is not available.
Successful Page Load