firstbacksecondback
24 Results
Poster
|
Wed 11:00 |
SafeWorld: Geo-Diverse Safety Alignment Da Yin · Haoyi Qiu · Kung-Hsiang Huang · Kai-Wei Chang · Nanyun Peng |
|
Poster
|
Thu 16:30 |
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space Leo Schwinn · David Dobre · Sophie Xhonneux · Gauthier Gidel · Stephan Günnemann |
|
Poster
|
Fri 16:30 |
One-Shot Safety Alignment for Large Language Models via Optimal Dualization Xinmeng Huang · Shuo Li · Edgar Dobriban · Osbert Bastani · Hamed Hassani · Dongsheng Ding |
|
Workshop
|
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents Samuel Brown · Basil Labib · Codruta Lugoj · Sai Sasank Y |
||
Workshop
|
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs Ruben Härle · Felix Friedrich · Manuel Brack · Björn Deiseroth · Patrick Schramowski · Kristian Kersting |
||
Workshop
|
Language Models Resist Alignment Jiaming Ji · Kaile Wang · Tianyi (Alex) Qiu · Boyuan Chen · Changye Li · Hantao Lou · Jiayi Zhou · Juntao Dai · Yaodong Yang |
||
Poster
|
Fri 16:30 |
Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack Tiansheng Huang · Sihao Hu · Fatih Ilhan · Selim Tekin · Ling Liu |
|
Workshop
|
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming Anurakt Kumar · Divyanshu Kumar · Jatan Loya · Nitin Aravind Birur · Tanay Baswa · Sahil Agarwal · Prashanth Harshangi |
||
Poster
|
Fri 16:30 |
Improving Alignment and Robustness with Circuit Breakers Andy Zou · Long Phan · Justin Wang · Derek Duenas · Maxwell Lin · Maksym Andriushchenko · J. Zico Kolter · Matt Fredrikson · Dan Hendrycks |
|
Poster
|
Thu 11:00 |
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment Jiongxiao Wang · Jiazhao LI · Yiquan Li · Xiangyu Qi · Junjie Hu · Sharon Li · Patrick McDaniel · Muhao Chen · Bo Li · Chaowei Xiao |
|
Workshop
|
HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine |
||
Workshop
|
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine |