Evaluating Spatial Reasoning in Language Models
Abstract
Existing reasoning benchmarks for language models (LMs) frequently fail to adequately assess spatial reasoning. In this work, we study spatial and topological reasoning by introducing a text-first benchmark built from Slitherlink and Nurikabe, two canonical constraint-satisfaction and grid-based connectivity puzzles. We generate this benchmark with a solver-aided framework that encodes constraints into Boolean form and samples solutions from these constraints with near-uniformity over a specified projection, yielding instance distributions that are diverse and minimally biased by handcrafted heuristics. We represent puzzle instances in a custom coordinate-based domain-specific language (DSL) and evaluate them with a rigorous validation engine. Baseline experiments show substantially higher accuracy on Nurikabe than on Slitherlink, with single-cycle loop topology emerging as the principal bottleneck; however, the results do not indicate any distinctive advantage in either puzzle family, showing that spatial reasoning remains an open challenge.