TDC 2023 (LLM Edition): The Trojan Detection Challenge

Mantas Mazeika · Andy Zou · Norman Mu · Long Phan · Zifan Wang · Chunru Yu · Adam Khoja · Fengqing Jiang · Aidan O'Gara · Zhen Xiang · Arezoo Rajabi · Dan Hendrycks · Radha Poovendran · Bo Li · David Forsyth

Room 357
[ ]
Sat 16 Dec 11:30 a.m. PST — 2:30 p.m. PST


The Trojan Detection Challenge (LLM Edition) aims to advance the understanding and development of methods for detecting hidden functionality in large language models (LLMs). The competition features two main tracks: the Trojan Detection Track and the Red Teaming Track. In the Trojan Detection Track, participants are given a large language model containing thousands of trojans and tasked with discovering the triggers for these trojans. In the Red Teaming Track, participants are challenged to elicit specific undesirable behaviors from a large language model fine-tuned to avoid those behaviors. TDC 2023 will include Base Model and Large Model subtracks to enable broader participation, and established trojan detection and red teaming baselines will be provided as a starting point. By uniting trojan detection and red teaming, TDC 2023 aims to foster collaboration between these communities to promote research on hidden functionality in LLMs and enhance the robustness and security of AI systems.

Chat is not available.