Poster 2 The Horcrux: Mechanistically Interpretable Task Decomposition for Reward Hacking Detection
2025 Poster Session
in
Workshop: NeurIPS 2025 Workshop on Embodied and Safe-Assured Robotic Systems
in
Workshop: NeurIPS 2025 Workshop on Embodied and Safe-Assured Robotic Systems
Video
Chat is not available.
Successful Page Load