Expo Talk Panel
Juries, Not Judges! Industry-Scale Evaluation of Trustworthy AI via Dynamic LLM Panels
Freddy Lecue
Upper Level Ballroom 20AB
As Large Language Models (LLMs) become central to high-stakes applications, the reliability of their evaluation systems is under intense scrutiny, especially in the financial industry. Traditional approaches - human annotation, single LLM judges, and static model juries - struggle to balance scalability, cost, and trustworthiness. We will discuss a promising framework: LLM Jury-on-Demand, a dynamic, learning-based framework that assembles an optimal panel of LLM evaluators for each task instance, leveraging predictive modeling to select and weight judges based on context-specific reliability. Our system adapts in real time, outperforming static ensembles and single judges in alignment with human expert judgment across summarization and retrieval-augmented generation benchmarks. This talk will showcase how adaptive LLM juries can transform evaluation of AI systems, offering robust, scalable, and context-aware solutions for industry and research. Attendees will gain practical insights into building trustworthy LLM evaluation pipelines, see live demos, and discuss future directions for reliable AI assessment in critical domains.
Live content is unavailable. Log in and register to view live content