NeurIPS 2024

Poster

Thu 11:00

Intruding with Words: Towards Understanding Graph Injection Attacks at the Text Level
Runlin Lei · Yuwei Hu · Yuchen Ren · Zhewei Wei

Workshop

Sun 14:40

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs

Workshop

LLM-PIRATE: A benchmark for indirect prompt injection attacks in Large Language Models
Anil Ramakrishna · Jimit Majmudar · Rahul Gupta · Devamanyu Hazarika

Poster

Wed 16:30

Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor
Shaokui Wei · Hongyuan Zha · Baoyuan Wu

Poster

Thu 16:30

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti · Jie Zhang · Mislav Balunovic · Luca Beurer-Kellner · Marc Fischer · Florian Tramer

Poster

Wed 16:30

Are Your Models Still Fair? Fairness Attacks on Graph Neural Networks via Node Injections
Zihan Luo · Hong Huang · Yongkang Zhou · Jiping Zhang · Nuo Chen · Hai Jin

Poster

Thu 16:30

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou · Bo Li · Haohan Wang

Workshop

Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI
Ambrish Rawat · Stefan Schoepf · Giulio Zizzo · Giandomenico Cornacchia · Muhammad Zaid Hameed · Kieran Fraser · Erik Miehling · Beat Buesser · Elizabeth Daly · Mark Purcell · Prasanna Sattigeri · Pin-Yu Chen · Kush Varshney

Poster

Fri 11:00

Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
Wenjie Fu · Huandong Wang · Chen Gao · Guanghua Liu · Yong Li · Tao Jiang

Poster

Thu 16:30

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn · David Dobre · Sophie Xhonneux · Gauthier Gidel · Stephan Günnemann

Workshop

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
Nathalie Kirch · Severin Field · Stephen Casper

Workshop

Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo · Giandomenico Cornacchia · Kieran Fraser · Muhammad Zaid Hameed · Ambrish Rawat · Beat Buesser · Mark Purcell · Pin-Yu Chen · Prasanna Sattigeri · Kush Varshney

Main Navigation

13 Results