Stability of Preference Alignment for Multi-Turn Control with LLM Policies
Abstract
Large language models (LLMs) are increasingly deployed in multi-turn control settings, such as interface navigation and robot manipulation, where stability over long horizons is critical. In this work, we provide a study of preference alignment methods, including group-relative policy optimization (GRPO), direct preference optimization (DPO), contrastive preference optimization (CPO), and a GRPO variant with behavior cloning regularization, in two domains: a tokenized gridworld and a shared-control racing task that necessitates long-horizon planning and interaction. Rather than proposing a new algorithm, our goal is to analyze stability trade-offs and clarify when existing approaches succeed or fail. We show that (1) contrastive methods such as DPO and CPO risk policy degradation without valid negatives, (2) such methods struggle to recover multi-modal behaviors from a pre-trained initialization, and (3) adding behavior cloning regularization to GRPO improves robustness in some multi-turn settings. Together, our findings provide practical guidance for applying alignment techniques to long-horizon interactive policies and highlight open challenges for stable, preference-aware LLM control.