Oral
in
Workshop: Towards Safe & Trustworthy Agents

Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

Marcus Williams ⋅ Micah Carroll ⋅ Constantin Weisser ⋅ Adhyyan Narang ⋅ Brendan Murphy ⋅ Anca Dragan

Project Page [ OpenReview]

Abstract

When AI systems are trained to maximize positive feedback from humans, this creates a perverse incentive structure for the AI to resort to any available means—including harmful behaviors like sycophancy, deception, and manipulation—to ensure it receives positive human feedback, regardless of whether its actions truly merit such approval. So far, with LLM training, this drive has only been documented in the emergence of relatively mild forms of sycophancy, in which the system overly agrees with or praises the user. Our work shows that in settings of practical LLM usage, optimizing user feedback (as opposed to annotator feedback) reliably leads to the emergence of manipulation, deception, and extreme forms of sycophancy which surgically target the users that are most vulnerable to them. To mitigate this issue, it seems promising to leverage external annotator feedback to "veto" that of users. We find that while such approach can reduce or remove the emergence of harmful behaviors in some settings, it can even exacerbate them in others, making them more sophisticated and harder to detect. Our findings caution against optimizing user feedback without stringent safeguards, and constitute a cautionary tale of the fundamental risks and limitations that come along with optimizing any form of feedback, whether from humans or AI systems.

Chat is not available.