Closed-Task Validation: A More Robust and Efficient Proxy for Guiding VLM Training
Abstract
Reliable and efficient validation is critical for guiding the resource-intensive process of training Vision-Language Models (VLMs). The standard evaluation paradigm, however, which relies on open-ended text generation, exhibits significant methodological limitations. We empirically demonstrate that this approach is unreliable, yielding high-variance metrics with a negligible correlation (r = 0.061) to final model performance. Furthermore, it is inefficient, as auto-regressive decoding introduces substantial latency and severe load-balancing issues in parallel evaluation. To address these limitations, we propose "Closed-Task" validation, a paradigm that bypasses auto-regressive decoding by converting questions into a multiple-choice format and directly inspecting token probabilities. Our experiments show this method is both highly reliable, producing stable signals strongly correlated (r = 0.798) with final performance, and efficient, achieving a >10x latency reduction with near-perfect load balancing. This work thus provides a robust and efficient validation methodology that resolves the interconnected challenges of evaluation reliability and system efficiency, offering a superior empirical framework for VLM development.