Why RL Updates Look Sparse: An Implicit Compass Drives Optimization Bias
Abstract
Reinforcement learning (RL) reliably improves LLM reasoning while appearing to change only a small fraction of parameters. We revisit this paradox and argue that the visible sparsity is not the phenomenon itself but the trace of a persistent optimization bias, where RLVR stubbornly commits updates to preferred regions that remain invariant across datasets and RL variants, as if guided by an implicit compass. We propose a Three‑Gate Theory to formalize this mechanism, where the Anchor Gate I shows RL induces a one‑step policy‑KL leash that keeps updates proximal to the base policy; This constrained update is then steered by Gate II (Model Geometry) towards lower-curvature, spectra-preserving directions, a data-invariant feature; and finally, it is filtered by Gate III (Precision), where the bfloat16 format acts as a lens that amplifies the bias by hiding micro-updates, making the underlying pattern visible as apparent sparsity. Empirically, we validate this theory with a comprehensive suite of experiments. We show that RL preserves the model’s spectral structure and avoids its principal weights, in sharp contrast to SFT, which alters spectra and mainly targets those weights. Causal interventions confirm that this bias is destroyed when the model's geometry is disrupted, proving that the geometry is the steering core of the "compass." By providing the first parameter-level account of RLVR's training dynamics, our work not only demystifies its optimization bias but also provides a new perspective of understanding RLVR and motivating the design of efficient RL training algorithms.