NeurIPS Poster Many-shot Jailbreaking

Poster

Many-shot Jailbreaking

Cem Anil · Esin DURMUS · Nina Panickssery · Mrinank Sharma · Joe Benton · Sandipan Kundu · Joshua Batson · Meg Tong · Jesse Mu · Daniel Ford · Francesco Mosconi · Rajashree Agrawal · Rylan Schaeffer · Naomi Bashkansky · Samuel Svenningsen · Mike Lambert · Ansh Radhakrishnan · Carson Denison · Evan Hubinger · Yuntao Bai · Trenton Bricken · Timothy Maxwell · Nicholas Schiefer · James Sully · Alex Tamkin · Tamera Lanham · Karina Nguyen · Tomek Korbak · Jared Kaplan · Deep Ganguli · Samuel Bowman · Ethan Perez · Roger Grosse · David Duvenaud

East Exhibit Hall A-C #2109

[ Abstract ]

[ Paper] [ Poster] [ OpenReview]

Fri 13 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This attack is newly feasible with the larger context windows recently deployed by language model providers like Google DeepMind, OpenAI and Anthropic. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.

Chat is not available.