WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios
Abstract
We introduce WearVQA, the first benchmark specifically designed to evaluate the visual questionanswering (VQA) capabilities of multi-modal AI assistant on wearable devices like smart glasses. Unlikeprior benchmarks that focus on high-quality, third-person imagery, WearVQA reflects the unique chal-lenges of ego-centric interaction—where visual inputs may be occluded, poorly lit, unzoomed, or blurry,and questions are grounded in realistic wearable use cases. The benchmark comprises 2,500 carefullycurated image-question-answer triplets, spanning 7 diverse image domains including both text-centricand general scenes, 10 cognitive task types ranging from basic recognition to various forms of reasoning,and 6 common wearables-specific image quality issues. All questions are designed to be answerable usingonly the visual input and common senses. WearVQA is paired with a rigorous LLM-as-a-judge evaluationframework with 96% labeling accuracy. Open-source and proprietary multi-modal LLMs achieved a QAaccuracy as low as 24–52% on WearVQA, with substantial drops on lower-quality images and reasoning-heavy tasks. These observations position WearVQA as a comprehensive and challenging benchmark forguiding technicial advancement towards robust, real-world multi-modal wearables AI systems.