Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

11 Results

<<   <   Page 1 of 1   >>   >
Poster
Thu 11:00 Secret Collusion among AI Agents: Multi-Agent Deception via Steganography
Sumeet Motwani · Mikhail Baranchuk · Martin Strohmeier · Vijay Bolina · Philip Torr · Lewis Hammond · Christian Schroeder de Witt
Workshop
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
Leo McKee-Reid · Joe Needham · Maria Martinez · Christoph Sträter · Mikita Balesni
Workshop
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
Workshop
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
Workshop
Algorithmic Oversight for Deceptive Reasoning
Ege Onur Taga · Mingchen Li · Yongqi Chen · Samet Oymak
Workshop
Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback
Marcus Williams · Micah Carroll · Constantin Weisser · Adhyyan Narang · Brendan Murphy · Anca Dragan
Workshop
Modelling the oversight of deceptive interpretability agents
Simon Lermen · Mateusz Dziemian
Workshop
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal · David Lindner
Workshop
Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback
Marcus Williams · Micah Carroll · Constantin Weisser · Brendan Murphy · Adhyyan Narang · Anca Dragan
Workshop
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompt
Yusu Qian · Haotian Zhang · Yinfei Yang · Zhe Gan
Workshop
INTERPRETABILITY OF LLM DECEPTION: UNIVERSAL MOTIF
Wannan Yang · Chen Sun · Gyorgy Buzsaki