PREMISE: Scalable and Strategic Prompt Optimization for Efficient Mathematical Reasoning in Large Models
Abstract
Large Reasoning Models (LRMs) like Claude 3.7 Sonnet and OpenAI o1 achieve strong performance on mathematical tasks via long Chain-of-Thought (CoT), but often generate unnecessarily verbose reasoning traces. This inflates token usage and cost, limiting deployment in latency-sensitive or API-constrained settings. To address this issue, we present \textbf{PREMISE} (\textit{PRompt-based Efficient Mathematical Inference with Strategic Evaluation}), an optimization framework designed specifically for black-box commercial LRMs. PREMISE reduces reasoning overhead without modifying model weights or requiring multiple queries. It combines trace-level diagnostics with gradient-based prompt optimization to minimize redundant computation while preserving answer accuracy. Across GSM8K, SVAMP, and Math500, PREMISE matches or exceeds baseline accuracy, while reducing reasoning tokens by up to \textbf{87.5\%} and cutting dollar cost by \textbf{69--82\%}.