Poster
in
Workshop: MATH-AI: The 5th Workshop on Mathematical Reasoning and AI

Evaluating Spatial Reasoning in Language Models

Aarush Gupta

Project Page [ OpenReview]

Abstract

Existing reasoning benchmarks for language models (LMs) frequently fail to adequately assess spatial reasoning. In this work, we study spatial and topological reasoning by introducing a text-first benchmark built from Slitherlink and Nurikabe, two canonical constraint-satisfaction and grid-based connectivity puzzles. We generate this benchmark with a solver-aided framework that encodes constraints into Boolean form and samples solutions from these constraints with near-uniformity over a specified projection, yielding instance distributions that are diverse and minimally biased by handcrafted heuristics. We represent puzzle instances in a custom coordinate-based domain-specific language (DSL) and evaluate them with a rigorous validation engine. Baseline experiments show substantially higher accuracy on Nurikabe than on Slitherlink, with single-cycle loop topology emerging as the principal bottleneck; however, the results do not indicate any distinctive advantage in either puzzle family, showing that spatial reasoning remains an open challenge.

Chat is not available.