Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code

Natalia Tarasova ⋅ Enrique Balp-Straffon ⋅ Aleksei Iancheruk ⋅ Yevhenii Sielskyi ⋅ Nikita Kozodoi ⋅ Liam Byrne ⋅ Jack Butler ⋅ Dayuan jiang ⋅ Marcin Czelej ⋅ Andrew Ang ⋅ Yash Shah ⋅ Roi Blanco ⋅ Sergey IVANOV

Project Page [ Slides] [ OpenReview]

Abstract

Infrastructure-as-code (IaC) is critical for cloud reliability and scalability, yet LLM capabilities in this domain remain underexplored. Existing benchmarks focus on declarative tools like Terraform and full-code generation. We introduce SWE-InfraBench, a dataset of realistic incremental edits to AWS CDK repositories from real-world codebases. Each task requires modifying existing IaC based on natural language instructions, with correctness verified by passed tests. Results show current LLMs struggle: the best model (Sonnet 3.7) solves 34% of tasks, while reasoning models like DeepSeek R1 reach only 24%

Chat is not available.