Jailbreak Defense in a Narrow Domain: Failures of Existing Methods and Improving Transcript-Based Classifiers
Tony Wang ⋅ John Hughes ⋅ Henry Sleight ⋅ Rylan Schaeffer ⋅ Rajashree Agrawal ⋅ Fazl Barez ⋅ Mrinank Sharma ⋅ Jesse Mu ⋅ Nir Shavit ⋅ Ethan Perez
Abstract
Defending large language models against jailbreaks so that they never engage in a broad set of forbidden behaviors is an open-problem. In this paper, we study if jailbreak-defense is more tractable if one only needs to forbid a very narrow set of behaviors. In particular, we focus on preventing an LLM from helping a user make a bomb, and find that popular defenses such as safety training, adversarial training, and input/output classifiers are inadequate. In pursuit of a better defense, we develop our own classifier-defense tailored to our bomb-setting, which outperforms existing defenses on some axes but is still ultimately broken. We conclude that jailbreak-defense is challenging even in a narrow domain.
Chat is not available.
Successful Page Load