Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

24 Results

<<   <   Page 2 of 2   >>   >
Poster
Wed 11:00 SafeWorld: Geo-Diverse Safety Alignment
Da Yin · Haoyi Qiu · Kung-Hsiang Huang · Kai-Wei Chang · Nanyun Peng
Poster
Thu 16:30 Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn · David Dobre · Sophie Xhonneux · Gauthier Gidel · Stephan Günnemann
Poster
Fri 16:30 One-Shot Safety Alignment for Large Language Models via Optimal Dualization
Xinmeng Huang · Shuo Li · Edgar Dobriban · Osbert Bastani · Hamed Hassani · Dongsheng Ding
Workshop
Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents
Samuel Brown · Basil Labib · Codruta Lugoj · Sai Sasank Y
Workshop
SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs
Ruben Härle · Felix Friedrich · Manuel Brack · Björn Deiseroth · Patrick Schramowski · Kristian Kersting
Workshop
Language Models Resist Alignment
Jiaming Ji · Kaile Wang · Tianyi (Alex) Qiu · Boyuan Chen · Changye Li · Hantao Lou · Jiayi Zhou · Juntao Dai · Yaodong Yang
Poster
Fri 16:30 Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
Tiansheng Huang · Sihao Hu · Fatih Ilhan · Selim Tekin · Ling Liu
Workshop
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
Anurakt Kumar · Divyanshu Kumar · Jatan Loya · Nitin Aravind Birur · Tanay Baswa · Sahil Agarwal · Prashanth Harshangi
Poster
Fri 16:30 Improving Alignment and Robustness with Circuit Breakers
Andy Zou · Long Phan · Justin Wang · Derek Duenas · Maxwell Lin · Maksym Andriushchenko · J. Zico Kolter · Matt Fredrikson · Dan Hendrycks
Poster
Thu 11:00 BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Jiongxiao Wang · Jiazhao LI · Yiquan Li · Xiangyu Qi · Junjie Hu · Sharon Li · Patrick McDaniel · Muhao Chen · Bo Li · Chaowei Xiao
Workshop
HarmAnalyst: Interpretable, transparent, and steerable LLM safety moderation
Jing-Jing Li · Valentina Pyatkin · Max Kleiman-Weiner · Liwei Jiang · Nouha Dziri · Anne Collins · Jana Schaich Borg · Maarten Sap · Yejin Choi · Sydney Levine
Workshop
Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries
Julius Broomfield · George Ingebretsen · Reihaneh Iranmanesh · Sara Pieri · Ethan Kosak-Hine · Tom Gibbs · Reihaneh Rabbany · Kellin Pelrine