NeurIPS 2024

Workshop

Sun 11:20

The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains
Ezra Edelman · Nikolaos Tsilivis · Surbhi Goel · Benjamin Edelman · Eran Malach

Workshop

MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal · David Lindner

Workshop

An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando

Workshop

Evaluating Synthetic Activations composed of SAE Latents in GPT-2
Nora Petrova · Giorgi Giglemiani · Chatrik Mangat · Jett Janiak · Stefan Heimersheim

Workshop

Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo · Yongjin Yang · Hwaran Lee

Workshop

Towards Safe Multilingual Frontier AI
Arturs Kanepajs · Vladimir Ivanov · Richard Moulange

Workshop

A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning
Anjun Hu · Jiyang Guan · Philip Torr · Francesco Pinto

Workshop

Plentiful Jailbreaks with String Compositions
Brian Huang

Workshop

Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection
Merlin Stein · Peter Cihon · Gagan Bansal · Sam Manning

Workshop

How Does LLM Compression Affect Weight Exfiltration Attacks?
Davis Brown · Mantas Mazeika

Workshop

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu · Shujian Zhang · Kaiqiang Song · Silei Xu · Sanqiang Zhao · Ravi Agrawal · Sathish Indurthi · Chong Xiang · Prateek Mittal · Wenxuan Zhou

Poster

SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types
Yutao Mou · Shikun Zhang · Wei Ye

Main Navigation