Weak Discriminative Verification Enables Strong Test-time Scaling
Abstract
Test-time scaling has become a popular strategy to boost large language model performance on complex reasoning tasks. A standard approach involves sampling multiple candidate solutions, then selecting the final answer via self-consistency or a verifier model. While generative verifiers can outperform self-consistency, they incur substantial overhead due to expensive chain-of-thought generation, and often yield limited gains under practical budgets. Discriminative verifiers, by contrast, are far more efficient but typically underperform self-consistency when the pool of candidate solutions grows large. In this work, we show that a weak discriminative verifier can be transformed into a strong but efficient test-time scaler. Specifically, by pairing a lightweight discriminative verifier with a simple pessimism penalty that down‐weights low‐support answers, our method can consistently outperform self‐consistency with minimum overhead in verification compute. On AIME2024, DeepSeek-R1-Distill-Qwen-32B paired with our method improves from 68.2\% to 79.7\% with just 4 candidate solutions -- matching the performance of o3-mini (medium) and outperforming self-consistency by 2.2\% for only 0.5\% additional compute. Our results suggest that lightweight discriminative verification with pessimistic scoring offers a practical and efficient solution to test-time scaling. Code is available at https://anonymous.4open.science/r/DPV-NeurIPS2025.