Skip to yearly menu bar Skip to main content


Capital One

Expo Workshop

Exploring Trust and Reliability in LLM Evaluation

Shixiong Zhang · Sambit Sahu · MILIND NAPHADE · Jordan Lacey

Upper Level Room 30A-E
[ ]
Tue 2 Dec noon PST — 1:30 p.m. PST

Abstract:

The current paradigm of Large Language Model (LLM) evaluation faces a crisis of reliability. Traditional leaderboards—built on static benchmarks and surface-level metrics—have become increasingly distorted by benchmark contamination, prompt overfitting, and evaluation methodologies that fail to reflect model behavior in real-world use. As reasoning models emerge that generate detailed internal thought processes (e.g., traces) before producing answers, existing evaluation practices—especially for multiple-choice and generation tasks—have become fundamentally inadequate.x000D
x000D
This lack of rigor not only undermines scientific progress and cross-model comparability, but also poses significant enterprise and societal risks, as evaluation results inform model selection, deployment safety, and governance in high-stakes environments.x000D
x000D
This workshop aims to reassert rigor in LLM evaluation by convening researchers and practitioners to address three intertwined challenges: (1) developing fair and consistent evaluation methods for reasoning and non-reasoning models, (2) confronting widespread contamination across public benchmarks and open-weight models, and (3) defining robust data curation and validation practices to prevent future contamination in both pretraining and post-training pipelines.x000D
x000D
By combining empirical findings, methodological advances, and practical case studies, this session—led by Capital One in collaboration with leading AI labs—seeks to chart a concrete path toward trustworthy, contamination-proof, and utility-aligned LLM evaluation frameworks.x000D
x000D
This 1.5-hour workshop will be structured around three highly focused, 25-minute talks, followed by a moderated discussion aimed at forging actionable paths forward for the community:x000D
x000D
Talk 1: Robust Evaluation for Reasoning & Non-Reasoning Modelsx000D
x000D
Talk 2: Benchmark Contamination — Detection, Measurement, & Findingsx000D
x000D
Talk 3: Preventing Contamination — Building Clean & Reliable Data Pipelines

Live content is unavailable. Log in and register to view live content