MedVAL: Toward Expert-Level Medical Text Validation with Language Models
Asad Aali ⋅ Vasiliki Bikia ⋅ Maya Varma ⋅ Nicole Chiou ⋅ Sophie Ostmeier ⋅ Arnav Singhvi ⋅ Magdalini Paschali ⋅ Ashwin Kumar ⋅ Andrew Johnston ⋅ Karimar Amador-Martinez ⋅ Eduardo Guerrero ⋅ Paola Rivera ⋅ Sergios Gatidis ⋅ Christian Bluethgen ⋅ Eduardo Reis ⋅ Eddy van Rilland ⋅ Poonam Hosamani ⋅ Kevin Keet ⋅ Minjoung Go ⋅ Evelyn Ling ⋅ David Larson ⋅ Curtis Langlotz ⋅ Roxana Daneshjou ⋅ Jason Hom ⋅ Sanmi Koyejo ⋅ Emily Alsentzer ⋅ Akshay Chaudhari
Abstract
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges, including a multilingual task reviewed by bilingual physicians. Each output is reviewed following a physician-defined taxonomy of risk levels and error categories, enabling evaluation of LMs in making safety decisions for deployment. Across 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves ($p < 0.001$) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66\% to 83\%, with per-sample safety classification scores up to 86\%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8\%. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
Video
Chat is not available.
Successful Page Load