Decomposing Reasoning Efficiency in Large Language Models
Abstract
While large language models (LLMs) are typically evaluated on accuracy measures alone, their deployment requires careful attention to computational efficiency. Building on the CogniLoad benchmark, we introduce a unified efficiency metric that measures “correct answers per 1.000 output tokens,” enabling direct, cross‑model comparison. Additionally, we propose an exact, interpretable decomposition into context robustness, logic robustness, and token appetite. Evaluating 15 SotA reasoning models on CogniLoad, our results show that models reach certain token efficiency differently: some are prone to logic errors, others waste tokens on verbose but correct solutions, and a few fail by reaching context limit. By turning overall efficiency into actionable components, our framework provides concrete targets to improve LLM reasoning efficiency.