A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
Abstract
We study how to train large language models (LLMs) as autonomous agents that act over multiple turns in agentic environments. While reinforcement learning has driven strong single-turn reasoning, extending to multi-turn environments introduce new challenges yet to be addressed. We formulate multi-turn agentic RL with dense per-turn rewards and token-level credit assignment, and provide a systematic analysis of the impacts three RL pillars -- environment, policy, and reward -- on multi-turn RL. Under interactive text environments (TextWorld, ALFWorld), we examine scaling with environment complexity and generalization across tasks; we analyze the role of model priors in subsequent multi-turn RL training; we compare the impact of sparse and dense per-turn rewards on RL learning. We provide an extensible code framework for multi-turn agentic RL. Together, our formulation, analysis, and toolkit offer practical guidance for building LLM agents capable of robust multi-turn decision making in agentic environments.