Poster
in
Workshop: Reliable ML from Unreliable Data Sat, Dec 6, 2025 • 11:00 AM – 12:00 PM PST

A Few Bad Neurons: Isolating and Surgically Correcting Sycophancy

Claire O'Brien · Jessica Seto · Dipankar Roy · Aditya Dwivedi · Sunishchal Dev · Kevin Zhu · Sean O'Brien · Ryan Lagasse

Project Page [ OpenReview]

Abstract

Behavioral alignment in large language models (LLMs) is often achieved through broad fine-tuning, which can result in undesired side effects like distributional shift and low interpretability. We propose a method for alignment that identifies and updates only the neurons most responsible for a given behavior, a targeted approach that allows for fine-tuning with significantly less data. Using sparse autoencoders (SAEs) and linear probes, we isolate the 3% of MLP neurons most predictive of a target behavior, decode them into residual space, and fine-tune only those neurons using gradient masking. We demonstrate this approach on the task of reducing sycophantic behavior, where our method matches or exceeds state-of-the-art performance on four benchmarks (Syco-Bench, NLP, POLI, PHIL) using Gemma-2-2B and 9B models. Our results show that sparse, neuron-level updates offer a scalable and precise alternative to full-model fine-tuning, remaining effective even in situations when little data is available. Code will be released upon acceptance.

Chat is not available.