Winning solution: Do you Know your Dataset? Early Benchmarking of Scientific Competence in Small Language Models
Abstract
Evaluating small language models (SLMs) early in training is challenging because standard benchmarks provide weak or noisy signals before models acquire core semantic and reasoning abilities. This issue is particularly pronounced for scientifically oriented tasks, which often require multi-step reasoning. For the E2LM competition, we investigated which properties of MCQA datasets improve the assessment of SLMs’ emerging scientific capabilities. Our work focuses on evaluating a wide range of existing benchmarks using the competition’s proposed methodology, and on constructing targeted dataset subsets and mixtures. Additionally, we introduce preliminary strategies for filtering datasets based on scientific compliance, including a conditional perplexity difference metric designed to identify scientifically discriminative samples. Finally, a newly created dataset of college-level exam questions highlights the limitations of SLMs when confronted with symbol-heavy and complex content.