Efficient Reasoning at Fixed Test-Time Cost via Length-Aware Attention Priors and Gain-Aware Training
Abstract
We frame efficient reasoning as making structured decisions under tight compute without increasing test-time cost. Concretely, we propose two training-time-only additions to small/medium Transformers: (i) a length-aware attention prior built by fuzzy regime–position alignment (RPA) that supplies a normalized pre-softmax bias, and (ii) a minimal gain-aware controller that sharpens attention only when validation utility warrants it; both leave inference unchanged. A KL-regularized MAP view, via softmax (z+log π), explains when such priors help and how they act as principled regularizers. Under strict compute parity on WikiText-2, the recipe lowers validation cross-entropy while matching baseline latency and memory, and we include diagnostics for long-span linkage and length generalization. The approach targets reasoning-flavored workloads (retrieval, long-span linkage, routing) and applies generically as a structured prior plus a slow controller for optimizers with scarce late-phase gains.