firstbacksecondback
13 Results
Workshop
|
LLM Improvement for Jailbreak Defense: Analysis Through the Lens of Over-Refusal Swetasudha Panda · Naveen Jafer Nizar · Michael Wick |
||
Workshop
|
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents Simon Lermen · Mateusz Dziemian · Govind Pimpale |
||
Poster
|
Wed 16:30 |
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes Xiaomeng Hu · Pin-Yu Chen · Tsung-Yi Ho |
|
Workshop
|
Sun 14:50 |
Does Refusal Training in LLMs Generalize to the Past Tense? |
|
Workshop
|
Evaluating Refusal Shira Abramovich · Anna J. Ma |
||
Workshop
|
Sat 12:00 |
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs Alexander von Recum · Christoph Schnabl · Gabor Hollbeck · Marvin von Hagen · Silas Alberti · Philip Blinde |
|
Workshop
|
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko · Nicolas Flammarion |
||
Workshop
|
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models Neel Jain · Aditya Shrivastava · Chenyang Zhu · Daben Liu · Alfy Samuel · Ashwinee Panda · Anoop Kumar · Micah Goldblum · Tom Goldstein |
||
Workshop
|
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko · Nicolas Flammarion |
||
Workshop
|
Does Refusal Training in LLMs Generalize to the Past Tense? Maksym Andriushchenko · Nicolas Flammarion |
||
Poster
|
Thu 16:30 |
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Seungju Han · Kavel Rao · Allyson Ettinger · Liwei Jiang · Bill Yuchen Lin · Nathan Lambert · Yejin Choi · Nouha Dziri |
|
Workshop
|
Stronger Universal and Transfer Attacks by Suppressing Refusals David Huang · Avidan Shah · Alexandre Araujo · David Wagner · Chawin Sitawarin |