Social

Evaluating Agentic Systems: Bridging Research Benchmarks and Real-World Impact

Tatia Tsmindashvili ⋅ Raphael Kalandadze

2025 Social

Project Page

Abstract

Agentic AI systems - LLM-driven agents capable of autonomous planning, tool use, and multi-step task execution - are rapidly advancing, yet methods for evaluating them remain underdeveloped. Traditional metrics for static or single-turn tasks fail to capture the complexity of open-ended, long-horizon interactions where goals evolve and behaviors emerge dynamically. This social aims to bridge research and industry perspectives on designing frameworks, simulation environments, and metrics that assess reliability, alignment, and safety in autonomous agents. Through lightning talks, panel discussions, and networking, the event fosters an interactive exchange on how to meaningfully evaluate and benchmark the next generation of agentic AI systems.

Video

Chat is not available.