Probing Reasoning Flaws and Safety Hierarchies with Chain-of-Thought Difference Amplification
Abstract
Detecting rare but critical failures in Large Language Models (LLMs) is a pressing challenge for safe deployment, as vulnerabilities introduced during alignment are often missed by standard benchmarks. We introduce Chain-of-Thought Difference (CoT Diff) Amplification, a logit-steering technique that systematically probes model reasoning. The method steers inference by amplifying the difference between outputs conditioned on two contrastive reasoning paths, allowing for targeted pressure-testing of a model’s behavioral tendencies. We apply this technique to a base model and a domain-adapted variant across a suite of safety and factual-coherence benchmarks. Our primary finding is the discovery of a clear hierarchy in the model’s safety guardrails: while the model refuses to provide unethical advice or pseudoscience at baseline, it readily generates detailed misinformation when prompted with a specific persona, revealing a critical vulnerability even without amplification.