Poster
in
Workshop: Reliable ML from Unreliable Data Sat, Dec 6, 2025 • 11:00 AM – 12:00 PM PST

Locks Tested Without Burglars: Using Coding Assistants to Break Prompt Injection Defenses

Atharv Singh Patlan · Pramod Viswanath · Prateek Mittal

Project Page [ OpenReview]

Abstract

Prompt injection is a critical security challenge for large language models (LLMs), yet proposed defenses are typically evaluated on toy benchmarks that fail to reflect real adversaries. We show that AI coding assistants, such as Claude Code, can act as automated red-teamers: they parse defense code, uncover hidden prompts and assumptions, and generate adaptive natural-language attacks. Evaluating three recent defenses -- DataSentinel, Melon, and DRIFT -- across standard and realistic benchmarks, we find that assistants extract defense logic and craft attacks that raise attack success rates by up to 50–60\%. These results suggest coding assistants are not just productivity tools but practical adversarial collaborators, and that defenses should be tested against them before claims of robustness are made.

Chat is not available.