Demystify the Potential of Large Language Models as World Models of Code
Bohan Lyu · Siqiao Huang · Zichen Liang · Wenjia Yang · Qian Sun · Jiaming Zhang
Abstract
A key frontier for Large Language Models (LLMs) is the development of internal world models that simulate and reason about the dynamics of an environment, which is foundational for their evolution from text generators into systems with deeper reasoning abilities. Code, with its structured logic and deterministic execution, serves as an ideal domain to study and cultivate such models. This work investigates a fundamental question: Can LLMs serve as world models of code, predicting program outcomes without actual execution? To systematically probe this potential, we introduce **WoC**, a holistic benchmark with $1160$ problems covering $8$ key aspects: multi-language programming, competition-level problems, repository-level analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, environment-dependent programs, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings offer crucial insights into the capabilities, limitations, and scaling properties of LLMs as world models of code, a critical step towards building more general and robust computational reasoning systems.
Chat is not available.
Successful Page Load