Poster
in
Workshop: Safe Generative AI

Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment

Karel Doosterlinck ⋅ Winnie Xu ⋅ Chris Develder ⋅ Thomas Demeester ⋅ Amanpreet Singh ⋅ Christopher Potts ⋅ Douwe Kiela ⋅ Shikib Mehri

Project Page [ Poster] [ OpenReview]

Abstract

Alignment of Large Language Models (LLMs) is crucial for ensuring their safety, particularly in preventing unintended behaviors and harmful outputs. However, aligning models using preference pair datasets does not always guarantee successful results, as models can accidentally be optimized for superficial cues in the data rather than genuinely desirable behaviors, an issue often referred to as reward hacking. We study the core principles of alignment and find that (i) preference data gives a more robust learning signal when the underlying responses are contrastive, and (ii) alignment objectives lead to more robust optimization when they specify more control over the model during training. Based on these insights, we introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs, and Anchored Preference Optimization (APO), a controllable and more stable alignment objective. Both our methods are designed to give AI practitioners precise control over how their model should change during alignment training, allowing them to build safer and more precise models. We align Llama-3-8b-Instruct using various comparable datasets and alignment objectives and measure MixEval-Hard scores, which correlate highly with human-produced rankings of models. The CLAIR preferences lead to the strongest performance out of all datasets, and APO consistently outperforms less controllable objectives. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8b-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%. The strong results of both our methods indicate their ability to precisely control what a model learns during alignment, mitigating reward hacking. Additionally, our experiments highlight how alignment can accidentally deteriorate model performance, inadvertently introducing safety risks.

Chat is not available.