Large Vision Language Models as Algorithmic Reasoners for Multimodal Annotations
Shaina Raza · Mahveen Raza
Abstract
Large vision–language models (LVLMs) can function as algorithmic annotators by not only assigning labels to multimodal inputs but also generating structured reasoning traces that justify those labels. We introduce \textbf{Reasoning-as-Annotation (RaA)}, a paradigm in which an LVLM outputs a human-interpretable rationale, calibrated confidence, and evidence pointers alongside each label, effectively acting as both classifier and explainer. We evaluate RaA on bias detection in images using a curated dataset of \~2,000 examples with human gold labels. Across closed- and open-source LVLMs, RaA preserves accuracy relative to black-box labeling while adding transparency: rationales were coherent and grounded in 75–90\% of cases, evidence pointers auditable in 70–85\%, and confidence scores correlated with correctness (\$r=0.60\$–\$0.76\$). These results show RaA is model-agnostic and maintains predictive quality while producing interpretable, auditable annotations. We position that RaA offers a scalable way to transform opaque labels into reusable reasoning traces for supervision and evaluation.
Successful Page Load