Poster
in
Workshop: The First Workshop on Efficient Reasoning

Why RL Updates Look Sparse: An Implicit Compass Drives Optimization Bias

Hanqing Zhu · Zhenyu Zhang · Hanxian Huang · DiJia Su · Zechun Liu · Jiawei Zhao · Igor Fedorov · Hamed Pirsiavash · Jinwon Lee · David Z. Pan · Zhangyang "Atlas" Wang · Yuandong Tian · Kai Sheng Tai

Project Page [ OpenReview]

Abstract

Reinforcement learning (RL) reliably improves LLM reasoning while appearing to change only a small fraction of parameters. We revisit this paradox and argue that the visible sparsity is not the phenomenon itself but the trace of a persistent optimization bias, where RLVR stubbornly commits updates to preferred regions that remain invariant across datasets and RL variants, as if guided by an implicit compass. We propose a Three‑Gate Theory to formalize this mechanism, where the Anchor Gate I shows RL induces a one‑step policy‑KL leash that keeps updates proximal to the base policy; This constrained update is then steered by Gate II (Model Geometry) towards lower-curvature, spectra-preserving directions, a data-invariant feature; and finally, it is filtered by Gate III (Precision), where the bfloat16 format acts as a lens that amplifies the bias by hiding micro-updates, making the underlying pattern visible as apparent sparsity. Empirically, we validate this theory with a comprehensive suite of experiments. We show that RL preserves the model’s spectral structure and avoids its principal weights, in sharp contrast to SFT, which alters spectra and mainly targets those weights. Causal interventions confirm that this bias is destroyed when the model's geometry is disrupted, proving that the geometry is the steering core of the "compass." By providing the first parameter-level account of RLVR's training dynamics, our work not only demystifies its optimization bias but also provides a new perspective of understanding RLVR and motivating the design of efficient RL training algorithms.

Chat is not available.