Poster
in
Workshop: Workshop on Multi-Turn Interactions in Large Language Models

Language Models Rate Their Own Actions As Safer

Dipika Khullar ⋅ Jack Hopkins ⋅ Rowan Wang ⋅ Fabien Roger

2025 Poster
in
Workshop: Workshop on Multi-Turn Interactions in Large Language Models

Project Page [ OpenReview]

Abstract

Large language models (LLMs) are increasingly used as evaluators of text quality, harmfulness and safety, yet their reliability as self-judges remains unclear. We identify self-attribution bias: when models evaluate actions they think they have just taken, they systematically underestimate risks compared to evaluating the same actions with the same information, but supposedly written by another model. For example, after being forced to click a phishing link, models rate this action as 20% less risky than when judging it in isolation. Evaluating 10 frontier LLMs across 4,500 samples spanning ethics dilemmas, factual questions, and computer-use scenarios, we show that this bias is robust across domains. AI developers should be careful when letting LLMs infer they are rating their own actions.

Chat is not available.