Expo Talk Panel
Beyond Benchmarks: Rethinking Reasoning in Language Models
Iman Mirzadeh · Mehrdad Farajtabar
Upper Level Ballroom 20AB
Reasoning is often described as the next frontier for AI, but what does it really mean for a model to “reason”, and how should we measure it? Popular benchmarks like GSM8K suggest steady progress, yet controlled studies reveal that models can fail dramatically under small changes—such as swapping numbers or adding irrelevant details. Large Reasoning Models (LRMs), which generate explicit chains of thought, raise new optimism but also expose clear limits: they often underperform standard models on simple tasks, improve briefly at medium complexity, and then collapse on harder ones despite having unused compute. Crucially, reasoning is not the same as knowledge recall, tool use, or agent-like behavior. True reasoning involves solving novel problems, decomposing them into steps, generalizing to new contexts, recombining partial results, and finally generating novel hypotheses—capabilities current systems largely lack. Today’s evaluations, focused on final answers and contaminated benchmarks, risk giving a misleading sense of progress. This talk will provide a critical review of reasoning in language models, highlight why current evaluations can be deceptive, and emphasize that reasoning is not just about “what" models answer, but “how" they solve problems.
Live content is unavailable. Log in and register to view live content