Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Towards Safe & Trustworthy Agents

AI Sandbagging: Language Models can Selectively Underperform on Evaluations

Teun van der Weij ⋅ Felix Hofstätter ⋅ Oliver Jaffe ⋅ Samuel Brown ⋅ Francis Ward

Abstract

Chat is not available.