Poster
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
JaeYoo Park · Jin Young Choi · Jeonghyung Park · Bohyung Han
East Exhibit Hall A-C #3600
We present a novel framework for OCR-free document understanding based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images.To address the increasing costs of considering the multi-scale visual inputs for the pretrained MLLMs, we propose a hierarchical visual feature aggregation module designed to reduce the number of input tokens to LLMs. Our approach leverages feature pyramid hierarchy with cross-attentive pooling, effectively handling the trade-off between information loss and efficiency without being affected by varying document image sizes.Additionally, we introduce a novel instruction tuning task that aims to enhance model readability by incorporating text positional information within images, which is robust to text truncation issue. Through comprehensive experiments, we demonstrate the efficacy of our framework in achieving outstanding document understanding performance on various tasks.
Live content is unavailable. Log in and register to view live content