SWE-InfraBench: Evaluating Language Models on Cloud Infrastructure Code
Natalia Tarasova ⋅ Enrique Balp-Straffon ⋅ Aleksei Iancheruk ⋅ Yevhenii Sielskyi ⋅ Nikita Kozodoi ⋅ Liam Byrne ⋅ Jack Butler ⋅ Dayuan jiang ⋅ Marcin Czelej ⋅ Andrew Ang ⋅ Yash Shah ⋅ Roi Blanco ⋅ Sergey IVANOV
Abstract
Infrastructure-as-code (IaC) is critical for cloud reliability and scalability, yet LLM capabilities in this domain remain underexplored. Existing benchmarks focus on declarative tools like Terraform and full-code generation. We introduce SWE-InfraBench, a dataset of realistic incremental edits to AWS CDK repositories from real-world codebases. Each task requires modifying existing IaC based on natural language instructions, with correctness verified by passed tests. Results show current LLMs struggle: the best model (Sonnet 3.7) solves 34% of tasks, while reasoning models like DeepSeek R1 reach only 24%
Chat is not available.
Successful Page Load