Efficient Reinforcement Learning for Optimizing Multi-turn Student Outcomes with LLM Tutors
Abstract
Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize immediate responses at each turn. However, this can fail in multi-turn dialogue settings, like online math tutoring, where a single-turn optimal tutor may give away answers instead of guiding the student step by step. We introduce a method that enhances LLM-basedtutors by representing the dialogue history with a lower-dimensional (student) state representation and optimizing a long-term policy to select high-level actions given that state. This better aligns the tutor with the long-term objective of helping the student solve the target math problem(s) independently. Our approach based on lower-dimensional states and high-level actions is more computationally efficientthan training the tutor policy end-to-end to directly generate the tutor’s response. In LLM-simulated tutoring scenarios evaluated on GSM8K, our approach improves student’s long-term outcomes by 50% compared to prompting baselines.