Skip to yearly menu bar Skip to main content


Poster

Query-Based Adversarial Prompt Generation

Jonathan Hayase · Ema Borevković · Nicholas Carlini · Florian Tramer · Milad Nasr

[ ]
Fri 13 Dec 11 a.m. PST — 2 p.m. PST

Abstract:

Recent work has shown it is possible to construct adversarial examplesthat cause an aligned language modelto emit harmful strings or perform harmful behavior.Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models.We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks.We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the OpenAI and Llama Guard safety classifiers with nearly 100% probability.

Live content is unavailable. Log in and register to view live content