Auto-Enhance: Towards a Meta-Benchmark to Evaluate AI Agents' Ability to Improve Other Agents
Abstract
LLM agents are rapidly improving at benchmarks designed to measure software development and reasoning ability. As the creation and improvement of such systems is itself a software development task, we are interested in this specific subset of ability. We describe the first steps towards a "meta-benchmark", in which a "top-level" agent aims to increase the performance of "reference agents" at tasks taken from a range of other benchmarks in the literature. We show that agents are able to alter other systems to a) better withstand prompt injection, b) fix non-trivial bugs in scaffolds, c) compare the performance of various LLMs at ``operating'' reference agents' scaffolds, and d) make progress towards model-editing tasks such as unlearning. We consider this a first step towards a broad and extensible meta-benchmark, which would allow us to measure AI self-improvement abilities.