MFCL Vision: Benchmarking Tool Use in Multimodal Large Language Models for Visual Reasoning Tasks
Abstract
As multimodal large language models become tool-using agents, the field still lacks a standardized metric for translating visual inputs into correct tool invocations. We introduce MFCL Vision, the first large-scale benchmark for vision-based function calling, comprising 250 expert-verified tasks across five image domains (Places, Events, Media, Sports, Shopping) and five query types (Locate, Temporal, Select, Identify, Quantify). Each task comprises (1) a textual user query, (2) an accompanying image, (3) a ground-truth answer obtained from the web, and (4) a human-produced reasoning trace for comparative error analysis. To constrain the task, we expose a singular web-search tool to each model. To examine the robustness of multimodal LLMs' perception-to-tool-use pipeline, we introduce controlled visual perturbations, including crops, resizes, and color channel removal. Our automatic grader computes exact-match scores on model final answers, removing dependence on brittle and potentially biased LLM judges. We evaluate leading models and present a taxonomy of failure modes, including visual reasoning, assumption bias, keyword selection, and tool avoidance errors. By releasing MFCL Vision’s dataset, taxonomy, and diagnostics, we aim to accelerate progress towards versatile multimodal agents capable of intelligent tool usage in complex visual contexts.