firstbacksecondback
126 Results
Workshop
|
Sun 11:20 |
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains Ezra Edelman · Nikolaos Tsilivis · Surbhi Goel · Benjamin Edelman · Eran Malach |
|
Workshop
|
MISR: Measuring Instrumental Self-Reasoning in Frontier Models Kai Fronsdal · David Lindner |
||
Workshop
|
An Adversarial Perspective on Machine Unlearning for AI Safety Jakub Łucki · Boyi Wei · Yangsibo Huang · Peter Henderson · Florian Tramer · Javier Rando |
||
Workshop
|
Evaluating Synthetic Activations composed of SAE Latents in GPT-2 Nora Petrova · Giorgi Giglemiani · Chatrik Mangat · Jett Janiak · Stefan Heimersheim |
||
Workshop
|
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding Haneul Yoo · Yongjin Yang · Hwaran Lee |
||
Workshop
|
Towards Safe Multilingual Frontier AI Arturs Kanepajs · Vladimir Ivanov · Richard Moulange |
||
Workshop
|
A Cautionary Tale on the Evaluation of Differentially Private In-Context Learning Anjun Hu · Jiyang Guan · Philip Torr · Francesco Pinto |
||
Workshop
|
Plentiful Jailbreaks with String Compositions Brian Huang |
||
Workshop
|
Measuring AI Agent Autonomy: Towards a Scalable Approach With Code Inspection Merlin Stein · Peter Cihon · Gagan Bansal · Sam Manning |
||
Workshop
|
How Does LLM Compression Affect Weight Exfiltration Attacks? Davis Brown · Mantas Mazeika |
||
Workshop
|
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy Tong Wu · Shujian Zhang · Kaiqiang Song · Silei Xu · Sanqiang Zhao · Ravi Agrawal · Sathish Indurthi · Chong Xiang · Prateek Mittal · Wenxuan Zhou |
||
Poster
|
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types Yutao Mou · Shikun Zhang · Wei Ye |