Demystifying and Enhancing the Efficiency of Interleaved Reasoning-Search LLM Agents
Tiannuo Yang · Zebin Yao · Bowen Jin · Lixiao Cui · Yusen Li · Gang Wang · xiaoguang Liu · Willie Neiswanger
Abstract
Large Language Model (LLM)-based search agents solve complex tasks by interleaving reasoning and retrieval, but this paradigm introduces critical efficiency bottlenecks. We identify two key inefficiency factors: (1) a non-monotonic tradeoff between retrieval accuracy and efficiency, where exact search incurs heavy overhead and overly coarse retrieval prolongs reasoning; (2) cascading latency from system design flaws, including improper scheduling and retrieval-induced stalls, where small retrieval delays amplify end-to-end inference time. To address these issues, we propose \texttt{SearchAgent-X}, a high-efficiency inference framework built on high-recall approximate retrieval with two new techniques: priority-aware scheduling and non-stall retrieval. Experiments show \texttt{SearchAgent-X} outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency without compromising generation quality. Code is available at \url{https://anonymous.4open.science/r/SearchAgent-X}.
Chat is not available.
Successful Page Load