Best student solution : Early-Stage LLM Evaluation: A Data-Driven Filtering Pipeline for MMLU-
Abstract
Evaluating Large Language Models (LLMs) during early training presents a unique challenge, as standard benchmarks often yield noisy signals that obscure true learning progress. We present a data-driven approach to constructing a high-quality benchmark derived from MMLU-var. Our methodology prioritized creating a comprehensive performance database by evaluating competition models across all training checkpoints first, enabling rapid iteration on filtering strategies later. We implemented a multi-stage pipeline that selects questions based on strict scientific compliance and signal quality. For compliance, we automatically retained core ”hard science” subjects (e.g., Physics, Math) while subjecting others (e.g., Biology, Medicine) to a rigorous LLM-as-a-Judge process, where each question was prompted five times for consistency and retained only if it received five ”Accept” verdicts. To optimize for signal quality, we fit a linear regression to the Confidence Margin scores of each question across training checkpoints, retaining only those with a positive learning trend. We demonstrate that this metric provides a significantly smoother and more monotonic signal than standard accuracy, effectively capturing the steady accumulation of knowledge in early-stage models.