firstbacksecondback
16 Results
Workshop
|
Introspection, Updatability, and Uncertainty Quantification with Transformers: Concrete Methods for AI Safety Allen Schmaltz · Danielle Rasooly |
||
Workshop
|
Fri 12:30 |
Sam Bowman: What's the deal with AI safety? Samuel Bowman |
|
Poster
|
Tue 14:00 |
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits Ruibo Liu · Chenyan Jia · Ge Zhang · Ziyu Zhuang · Tony Liu · Soroush Vosoughi |
|
Workshop
|
Fri 7:00 |
Workshop on Machine Learning Safety Dan Hendrycks · Victoria Krakovna · Dawn Song · Jacob Steinhardt · Nicholas Carlini |
|
Workshop
|
Red-Teaming the Stable Diffusion Safety Filter Javier Rando · Daniel Paleka · David Lindner · Lennart Heim · Florian Tramer |
||
Poster
|
Wed 9:00 |
Capturing Failures of Large Language Models via Human Cognitive Biases Erik Jones · Jacob Steinhardt |
|
Poster
|
Wed 9:00 |
On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach Dennis Wei · Rahul Nair · Amit Dhurandhar · Kush Varshney · Elizabeth Daly · Moninder Singh |
|
Competition
|
Thu 13:00 |
The Trojan Detection Challenge Mantas Mazeika · Dan Hendrycks · Huichen Li · Xiaojun Xu · Andy Zou · Sidney Hough · Arezoo Rajabi · Dawn Song · Radha Poovendran · Bo Li · David Forsyth |
|
Poster
|
Thu 9:00 |
Defining and Characterizing Reward Gaming Joar Skalse · Nikolaus Howe · Dmitrii Krasheninnikov · David Krueger |
|
Workshop
|
Fri 7:25 |
Beyond Safety: Toward a Value-Sensitive Approach to the Design of AI Systems. Alexander J. Fiannaca, Cynthia L. Bennett, Shaun Kane, Meredith Ringel Morris |
|
Workshop
|
Indexing AI Risks with Incidents, Issues, and Variants Sean McGregor · Kevin Paeth · Khoa Lam |
||
Poster
|
MExMI: Pool-based Active Model Extraction Crossover Membership Inference Yaxin Xiao · Qingqing Ye · Haibo Hu · Huadi Zheng · Chengfang Fang · Jie Shi |