Poster
in
Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Rethinking MCQ Benchmarks: Mandatory Reasoning Evaluation Reveals Significant Performance Drops in Large Language Models

Yue Zhang ⋅ Nhan Nguyen

Project Page [ Poster] [ OpenReview]

Abstract

Rigorous evaluation of Large Language Models (LLMs) is critical for their adoption in high-stakes applications, particularly in highly technical domains that require deep expertise and specialized training. The proliferation of LLMs from various providers further underscores the need for comprehensive model performance benchmarking. Like many standardized tests and certification exams, several prominent LLM benchmark datasets employ the Multiple Choice Questions (MCQs) format, such as GPQA and MMLU. Despite its convenience, prior research has demonstrated that MCQs oversimplify assessment and may not accurately reflect true model capabilities. Traditional MCQ evaluation only assesses whether the final answer is correct, potentially allowing models to succeed through lucky guessing or flawed reasoning. In the current work, we designed and performed a two-stage evaluation framework that mandates LLMs not only select the correct answer but also provide reasoning or derivation processes and justification for each options. We evaluated 21 LLMs across three model families on two datasets: GPQA and a custom dataset of 500 AWS-related questions curated from public online materials. Our results show that LLMs consistently struggle when both answer accuracy and reasoning quality are required, with performance scores dropping significantly across all models by 10\% to 45\% absolute points compared to conventional final answer-only evaluation. Based on these findings, we recommend that all MCQ benchmarks incorporate explanation evaluation as a standard component.

Chat is not available.