Skip to yearly menu bar Skip to main content


Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal ⋅ Ramneet Kaur ⋅ Colin Samplawski ⋅ Manoj Acharya ⋅ Anirban Roy ⋅ Daniel Elenius ⋅ Brian Matejek ⋅ Adam Cobb ⋅ Susmit Jha

Abstract

Chat is not available.