Poster
Query-Based Adversarial Prompt Generation
Jonathan Hayase · Ema Borevković · Nicholas Carlini · Florian Tramer · Milad Nasr
East Exhibit Hall A-C #4501
Recent work has shown it is possible to construct adversarial examplesthat cause an aligned language modelto emit harmful strings or perform harmful behavior.Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models.We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks.We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the OpenAI and Llama Guard safety classifiers with nearly 100% probability.
Live content is unavailable. Log in and register to view live content