firstbacksecondback
11 Results
Poster
|
Thu 11:00 |
Secret Collusion among AI Agents: Multi-Agent Deception via Steganography Sumeet Motwani · Mikhail Baranchuk · Martin Strohmeier · Vijay Bolina · Philip Torr · Lewis Hammond · Christian Schroeder de Witt |
|
Workshop
|
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni |
||
Workshop
|
Algorithmic Oversight for Deceptive Reasoning Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak |
||
Workshop
|
Algorithmic Oversight for Deceptive Reasoning Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak |
||
Workshop
|
Algorithmic Oversight for Deceptive Reasoning Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak |
||
Workshop
|
Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback Marcus Williams · Micah Carroll · Constantin Weisser · Adhyyan Narang · Brendan Murphy · Anca Dragan |
||
Workshop
|
Modelling the oversight of deceptive interpretability agents Simon Lermen · Mateusz Dziemian |
||
Workshop
|
MISR: Measuring Instrumental Self-Reasoning in Frontier Models Kai Fronsdal · David Lindner |
||
Workshop
|
Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback Marcus Williams · Micah Carroll · Constantin Weisser · Brendan Murphy · Adhyyan Narang · Anca Dragan |
||
Workshop
|
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt Yusu Qian · Haotian Zhang · Yinfei Yang · Zhe Gan |
||
Workshop
|
INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF Wannan Yang · Chen Sun · Gyorgy Buzsaki |