TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
Hsin Yi Hsieh · Shang-Wei Liu · Chang-Chih Meng · Chien-Hua Chen · Shuo-Yueh Lin · Hung-Ju Lin · Hen-Hsen Huang · I-Chen Wu
Abstract
Vision-language models (VLMs) often struggle with culturally specific content — a challenge largely overlooked by existing benchmarks that focus on dominant languages and globalized datasets. We introduce TᴀɪᴡᴀɴVQA, a VQA benchmark designed for Taiwanese culture to evaluate recognition and reasoning in regional contexts. TᴀɪᴡᴀɴVQA contains 2,736 images and 5,472 manually curated questions covering topics such as traditional foods, public signs, festivals, and landmarks. The official benchmark set includes 1,000 images and 2,000 questions for systematic assessment, with the remainder of the data used as training material. Evaluations on state-of-the-art VLMs reveal strong visual recognition but notable weaknesses in cultural reasoning. To address this, we propose a data augmentation strategy that combines human-annotated and synthesized dialogues to enhance cultural understanding. Fine-tuning yields significant gains on TᴀɪᴡᴀɴVQA while maintaining stable performance on other multimodal tasks. To further explore the models’ cultural understanding, we conducted an open-ended question answering experiment. The results indicate a notable decline in cultural knowledge generation ($\approx$10–20\%), suggesting challenges remain. TᴀɪᴡᴀɴVQA offers a scalable framework for building culturally grounded AI models in low-resource cultures, promoting diversity and fairness in multimodal AI. Our dataset and code are publicly available on [Hugging Face](https://huggingface.co/datasets/hhhuang/TaiwanVQA) and [GitHub](https://github.com/hhhuang/TaiwanVQA).
Successful Page Load