The Importance and Fragility of CoT Monitorability
Bowen Baker
Abstract
AI systems that verbalize their “thinking” in human language offer a new, yet possibly fragile, opportunity for AI safety. We’ve found that monitoring chains of thought (CoT) can be highly effective for catching misbehavior during frontier reasoning model training and deployment. However, chain-of-thought monitorability may prove fragile in the face of increased scaling and algorithmic advancements. In this talk, we’ll discuss existing work OpenAI has done in this area and where we are looking to go moving forward.
Successful Page Load