Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

13 Results

<<   <   Page 1 of 2   >   >>
Workshop
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal
Swetasudha Panda · Naveen Jafer Nizar · Michael Wick
Workshop
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen · Mateusz Dziemian · Govind Pimpale
Poster
Wed 16:30 Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Xiaomeng Hu · Pin-Yu Chen · Tsung-Yi Ho
Workshop
Sun 14:50 Does Refusal Training in LLMs Generalize to the Past Tense?
Workshop
Evaluating Refusal
Shira Abramovich · Anna J. Ma
Workshop
Sat 12:00 Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs
Alexander von Recum · Christoph Schnabl · Gabor Hollbeck · Marvin von Hagen · Silas Alberti · Philip Blinde
Workshop
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko · Nicolas Flammarion
Workshop
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
Neel Jain · Aditya Shrivastava · Chenyang Zhu · Daben Liu · Alfy Samuel · Ashwinee Panda · Anoop Kumar · Micah Goldblum · Tom Goldstein
Workshop
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko · Nicolas Flammarion
Workshop
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko · Nicolas Flammarion
Poster
Thu 16:30 WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han · Kavel Rao · Allyson Ettinger · Liwei Jiang · Bill Yuchen Lin · Nathan Lambert · Yejin Choi · Nouha Dziri
Workshop
Stronger Universal and Transfer Attacks by Suppressing Refusals
David Huang · Avidan Shah · Alexandre Araujo · David Wagner · Chawin Sitawarin