firstbacksecondback
20 Results
Workshop
|
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning Seanie Lee · Minsu Kim · Lynn Cherif · David Dobre · Juho Lee · Sung Ju Hwang · Kenji Kawaguchi · Gauthier Gidel · Yoshua Bengio · Nikolay Malkin · Moksh Jain |
||
Poster
|
Fri 16:30 |
Improving Alignment and Robustness with Circuit Breakers Andy Zou · Long Phan · Justin Wang · Derek Duenas · Maxwell Lin · Maksym Andriushchenko · J. Zico Kolter · Matt Fredrikson · Dan Hendrycks |
|
Poster
|
Thu 16:30 |
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space Leo Schwinn · David Dobre · Sophie Xhonneux · Gauthier Gidel · Stephan Günnemann |
|
Workshop
|
Plentiful Jailbreaks with String Compositions Brian Huang |
||
Workshop
|
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage Rafi Rashid · Jing Liu · Toshiaki Koike-Akino · Shagufta Mehnaz · Ye Wang |
||
Workshop
|
Plentiful Jailbreaks with String Compositions Brian Huang |
||
Workshop
|
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko · Nicolas Flammarion |
||
Workshop
|
Between the Bars: Gradient-based Jailbreaks are Bugs that induce Features Kaivalya Hariharan · Uzay Girit |
||
Affinity Event
|
Tue 14:00 |
Invited Talk 2 by Lama Ahmad (Technical Program Manager, Trustworthy AI at OpenAI): Human and AI Evaluations for Safety and Robustness Testing Lama Ahmad |
|
Workshop
|
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Aidan Ewart · Abhay Sheshadri · Phillip Guo · Aengus Lynch · Cindy Wu · Vivek Hebbar · Henry Sleight · Asa Cooper Stickland · Ethan Perez · Dylan Hadfield-Menell · Stephen Casper |
||
Workshop
|
Sun 11:05 |
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
|
Workshop
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |