Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models
Abstract
Supervised fine-tuning (SFT) on chain-of-thought data induces brittleness in language models, improving reasoning capabilities while severely degrading general performance. We provide the first mechanistic explanation for this trade-off through three complementary techniques: crosscoders for mapping feature transformations, Fisher Information-based identification of causal features, and gradient blocking for intervention experiments. Our analysis reveals that SFT operates through two distinct mechanisms—repurposing shared features for reasoning tasks and suppressing base-only features. Fisher Information with Sparse Autoencoders identifies the specific features responsible for reasoning, validated through feature steering that achieves 3.46% performance gains on base models. Crosscoder analysis demonstrates that SFT repurposes existing reasoning capabilities in the base model rather than creating new ones. Gradient blocking experiments prove these mechanisms are separable: blocking shared features eliminates reasoning entirely, while blocking base-only features preserves it, demonstrating that base feature suppression is unnecessary for reasoning. This mechanistic understanding provides the foundation for developing surgical training methods that preserve general capabilities while enhancing reasoning.