Skip to yearly menu bar Skip to main content


Spotlight Poster

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma · Yuhang Zang · Liangyu Chen · Meiqi Chen · Yizhu Jiao · Xinze Li · Xinyuan Lu · Ziyu Liu · Yan Ma · Xiaoyi Dong · Pan Zhang · Liangming Pan · Yu-Gang Jiang · Jiaqi Wang · Yixin Cao · Aixin Sun

[ ]
Wed 11 Dec 4:30 p.m. PST — 7:30 p.m. PST

Abstract:

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, including single-page document understanding (DU). However, their abilities on long-context DU abilities remain an open problem due to the lack of related benchmarks. This work presents MMLongBench-Doc, a long-context, multi-modality benchmark constructed upon 130 lengthy documents with an average of 49.4 pages and 20,971 tokens. It incorporates 1,062 expert-annotated questions and evaluates LVLMs' long-context DU abilities from diverse aspects: information identification (44.0\% single-page question), cross-page comprehension (33.2\% cross-page question) and hallucination severity (22.8\% unanswerable question). Towards comprehensive evaluation, these questions cover diverse evidence sources (i.e., text, image, chart, table, layout structure) and locations. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models. Notably, the best-performing GPT-4o achieves only a 42.7\% F1 score, while the second-best GPT-4V scores 31.4\%. Furthermore, most LVLMs even present worse performance than single-modality LLMs which are fed with OCR-parsed, lossy documents. These results validate the necessity of future research toward better long-context LVLMs for this task.

Live content is unavailable. Log in and register to view live content