Test-Time Scaling for Multistep Reasoning in Small Language Models via A* Search
Abstract
Large language models (LLMs) have demonstrated strong abilities across various tasks but are costly in computation and memory. In contrast, Small Language Models (SLMs) offer significant advantages in efficiency and deployability but usually struggle with complex mathematical reasoning tasks. To tackle this issue, we present the Test-time A* Search (TTA), a test-time scaling framework that casts reasoning as a goal-directed search over a tree of partial solutions in this paper. TTA is training-free and requires no external supervision or multi-model structure, making it practical in resource-constrained settings. As a drop-in decoding wrapper for SLMs, TTA* systematically explores, critiques, and refines candidate solution paths via its own self-reflection capability. Extensive experiments on popular mathematical reasoning benchmarks and a variety of base models show that TTA* consistently improves accuracy and robustness, indicating broad applicability to general mathematical reasoning tasks.