NeurIPS Poster Query-Based Adversarial Prompt Generation

Poster

Query-Based Adversarial Prompt Generation

Jonathan Hayase · Ema Borevković · Nicholas Carlini · Florian Tramer · Milad Nasr

East Exhibit Hall A-C #4501

[ Abstract ]

[ Paper] [ OpenReview]

Abstract:

Recent work has shown it is possible to construct adversarial examples that cause aligned language models to emit harmful strings or perform harmful behavior.Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models.We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks.We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the OpenAI and Llama Guard safety classifiers with nearly 100% probability.

Chat is not available.