A Time-Series Vision–Language Model for Predicting Progression of Diabetic Retinopathy
Abstract
Early detection of diabetic retinopathy (DR) progression is critical for timely intervention and prevention of vision loss. We present a time-series vision--language model that integrates longitudinal clinical context with retinal fundus images to forecast progression to referable DR at 1-, 2-, and 3-year horizons. The framework aligns fundus photographs with structured narrative prompts that encode demographics, diabetes history, and prior screening outcomes. Training is formulated as a contrastive objective, encouraging image embeddings to align with the correct horizon-specific outcome hypothesis. Using a national screening dataset of more than one million visits, we show that incorporating longitudinal information into the prompts consistently improves predictive performance, with the best one-year configuration achieving an AUROC of 0.707. The approach offers two key advantages: interpretability, by conditioning predictions on explicit clinical narratives, and extensibility, by allowing prompts to be adapted or enriched with additional timepoint information. To our knowledge, this is the first vision--language framework for horizon-specific DR forecasting, establishing a simple and reproducible baseline for adaptive recall scheduling, triage, and population-level risk management in DR screening programmes.