BigDocs: A Permissively-Licensed Dataset for Training Vision-Language Models on Document and Code Tasks
Abstract
Vision and language models that can accurately understand both images and textare crucial for deeper document understanding. These models can efficiently perform enterprise-level tasks, such as receipt processing from screenshots, website and business workflow generation from sketches, and extracting information from structured documents. These tasks often require generating long, structured outputs, an area where models trained on current datasets struggle. Additionally, many existing datasets are not license-permissive, limiting their use to non-commercial applications. To address these limitations, we present BigDocs, a high-quality, specifically curated dataset to train license-permissive Vision and Language Models (VLMs) capable of performing a wide variety of tasks. This dataset focuses on acquiring accurate image-text pairs across diverse tasks while adhering to accountability, responsibility, and transparency (ART) standards. Our preliminary experiments demonstrate that pre-training with BigDocs yields performance boosts in document reasoning and tasks requiring long structured outputs such as screenshot-to-HTML, table-to-Latex, or image-to-SVG. We believe that VLMs trained on BigDocs have the potential to enhance multimodal capabilities significantly, benefiting broader research in multimodal document understanding.