Efficient Vision-Language Reasoning via Adaptive Token Pruning
Abstract
As vision-language models (VLMs) continue to advance toward real-world deployment in domains such as robotics, autonomous systems, and assistive technologies, their computational and memory demands pose a persistent bottleneck. Existing architectures typically process all visual and textual tokens uniformly, regardless of their contribution to the final prediction, leading to inefficiencies and latency that hinder scalability. In this work, we introduce Adaptive Token Pruning (ATP), a dynamic inference mechanism that learns to identify and retain only the most informative subset of multimodal tokens based on their contextual relevance. ATP operates by analyzing cross-modal attention distributions at each transformer layer, estimating token importance scores derived from both inter- and intra-modal saliency. Tokens deemed redundant are pruned progressively, allowing the model to focus computation on semantically rich regions and phrases while maintaining alignment across modalities. Unlike static compression or distillation approaches, ATP is fully adaptive to each input instance, requiring no retraining or architecture redesign. We implement ATP as a lightweight gating module compatible with popular VLM backbones such as BLIP-2, LLaVA, and Flamingo. Empirical evaluations across VQAv2, GQA, and COCO Captioning demonstrate that ATP reduces inference FLOPs by up to 45\% and latency by 1.8×, with negligible loss (<1\%) in task accuracy. Moreover, qualitative analyses show that pruned models preserve visual grounding and contextual reasoning fidelity, indicating that token pruning can also serve as a lens into model interpretability. Beyond efficiency, we investigate the robustness of ATP-enhanced models under visual corruption and linguistic perturbation scenarios. Results reveal that adaptive pruning tends to suppress spurious correlations and hallucinated features, yielding improved stability across noise conditions. These findings suggest that resource-constrained inference and model reliability are not necessarily competing objectives—adaptive mechanisms can improve both simultaneously. Finally, we discuss how ATP can be integrated into deployment pipelines for multimodal edge computing, emphasizing its role as a general design principle for efficient, robust, and real-time VLM reasoning.