Skip to yearly menu bar Skip to main content

Workshop: Instruction Tuning and Instruction Following

Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Yanzhe Zhang · Ruiyi Zhang · Jiuxiang Gu · Yufan Zhou · Nedim Lipka · Diyi Yang · Tong Sun

Keywords: [ Instruction Finetuning ] [ large language model ] [ multimodal ]


Instruction tuning enhances the capability of Large Language Models (LLMs) to interact with humans. Furthermore, recent instruction-following datasets include images as visual input, collecting responses for image-based instructions. However, current visual instruction-tuned models cannot comprehend textual details within images well. This work enhances the current visual instruction tuning pipeline with text-rich images (e.g., movie posters, book covers, etc.). Specifically, we first used publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset. Furthermore, we prompt text-only GPT-4 with recognized text and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images. By combining our collected data with previous multimodal instruction-following data, our model, LLaVAR, substantially improves the capability of the LLaVA model on text-based VQA datasets (up to 20% accuracy improvement). The GPT-4-based instruction-following evaluation also demonstrates the improvement of our model on both natural images and text-rich images. Through qualitative analysis, LLaVAR shows promising interaction skills (e.g., reasoning, writing, and elaboration) with humans based on the latest real-world online content that combines text and images. We make our code/data/models publicly available.

Chat is not available.