Invited Talk 4 - Peter Henderson
Peter Henderson
Abstract
Title: Building Better Rules and Finding Shallow Alignment in AI Agents
Abstract: We are increasingly relying on natural language specifications, or human feedback, to align agents. But getting long-horizon behaviors right under ambiguous specifications can be challenging. In this talk, we take an interdisciplinary perspective on how to lint rule specifications for potential ambiguities. We also examine how fine-tuning can induce (or undo) "shallow" alignment in AI agents.
Video
Chat is not available.
Successful Page Load