Skip to yearly menu bar Skip to main content


Poster

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

Xiang Li · Jian Ding · Mohamed Elhoseiny

West Ballroom A-D #5302
[ ]
Thu 12 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. While several vision and language datasets in remote sensing have been proposed to pursue this goal, they often have significant limitations. Existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. To address these issues, we present a versatile vision-language benchmark for remote sensing image understanding, termed VERSAL. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 124,037 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing.

Live content is unavailable. Log in and register to view live content