Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Workshop on robustness of zero/few-shot learning in foundation models (R0-FoMo)

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey · Eric Wong · Hamed Hassani · George J. Pappas


Abstract:

Despite efforts to align large language models (LLMs), widely-used LLMs such as GPT and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation. Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM.

Chat is not available.