Aligning to Thousands of Varying Preferences via System Message Generalization
Abstract
Current large language model (LLM) alignment methods often assume that aligning LLMs with general public preferences is optimal, overlooking individual value diversity. A major challenge in adopting a more individualized approach to LLM alignment is its lack of scalability, as it involves re-training new models for new value or user. We propose a new paradigm where users specify their values within the system message, steering LLM behavior to align with individual intentions. However, LLMs are typically trained on a generic system messages (e.g., "You are a helpful assistant"). To improve generalization to diverse system messages, we create a system message dataset with 197k value combinations across 66k user instructions. We train a 7B LLM, Janus, and test it on 921 prompts from 5 benchmarks, adding various unseen system messages reflecting user preferences. Janus achieves high tie+win rates against leading models, including GPT-4. Unexpectedly, Janus also outperforms LLaMA 3 8B Instruct on general helpfulness benchmarks, suggesting that training with diverse system messages enhances alignment with both individual and general preferences. Code, dataset, benchmark, and models are available at https://anonymous.4open.science/r/janus.