Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

7 Results

<<   <   Page 1 of 1   >>   >
Workshop
Controllable Safety Alignment: Adapting LLMs to Diverse Safety Requirements without Re-Training
Jingyu Zhang · Ahmed Elgohary Ghoneim · Ahmed Magooda · Daniel Khashabi · Ben Van Durme
Workshop
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando
Workshop
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando
Workshop
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko · Nicolas Flammarion
Poster
Thu 11:00 Transcendence: Generative Models Can Outperform The Experts That Train Them
Edwin Zhang · Vincent Zhu · Naomi Saphra · Anat Kleiman · Benjamin Edelman · Milind Tambe · Sham Kakade · Eran Malach
Workshop
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
Sravanti Addepalli · Yerram Varun · Arun Suggala · Karthikeyan Shanmugam · Prateek Jain
Workshop
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Aidan Ewart · Abhay Sheshadri · Phillip Guo · Aengus Lynch · Cindy Wu · Vivek Hebbar · Henry Sleight · Asa Cooper Stickland · Ethan Perez · Dylan Hadfield-Menell · Stephen Casper