Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact
Abstract
Agentic AI systems - LLM-driven agents capable of autonomous planning, tool use, and multi-step task execution - are rapidly advancing, yet methods for evaluating them remain underdeveloped. Traditional metrics for static or single-turn tasks fail to capture the complexity of open-ended, long-horizon interactions where goals evolve and behaviors emerge dynamically. This social aims to bridge research and industry perspectives on designing frameworks, simulation environments, and metrics that assess reliability, alignment, and safety in autonomous agents. Through lightning talks, panel discussions, and networking, the event fosters an interactive exchange on how to meaningfully evaluate and benchmark the next generation of agentic AI systems.
Log in and register to view live content
| NeurIPS uses cookies for essential functions only. We do not sell your personal information. Our Privacy Policy » |