Skip to yearly menu bar Skip to main content


Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Aidan Ewart ⋅ Abhay Sheshadri ⋅ Phillip Guo ⋅ Aengus Lynch ⋅ Cindy Wu ⋅ Vivek Hebbar ⋅ Henry Sleight ⋅ Asa Cooper Stickland ⋅ Ethan Perez ⋅ Dylan Hadfield-Menell ⋅ Stephen Casper

Abstract

Chat is not available.