Delay-of-Gratification as a Multi-Agent Survival Micro-benchmark for Long-Horizon LLMs: Social Exposure, Personas, and Tool Use Budgets
Abstract
Large language models (LLMs) are increasingly deployed as multi-turn agents that must sustain goals, use tools, and adapt to other agents over extended interactions. However, existing research lacks auditable, multi-turn, multi-factorial experimentsthat quantify LLM behavior under explicit constraints, with time-resolved statistics that reveal how behavior unfolds over long horizons. To address this gap, we develop a multi-agent microbenchmark inspired by the Stanford marshmallow experiment: ReAct agents operate minute-by-minute with a "raise a question" tool under a per-step budget, while we factorially manipulate social context (broadcastvs. isolated), personas (age, hedonic drive), and metacognitive policy (must vs. may follow instructions). We analyze outcomes with Kaplan-Meier(KM) survival curves and discrete-time hazard models over a long risk horizon. Across 19,200 agent trajectories in 64 cells (horizon T = 19), 99.9% of runs were valid. The behavior exhibits a sharp early "eat" impulse (initial eat = 0.125), a total eat rate of 0.241, and 75.9% of agents persist to the end. The waiting profile is summarized by a median time-to-eat ≈ of approximately 14.8 and a RMST ≈ of approximately 14.8. In a discrete-time hazard model, isolation reduces per-minute risk relative to broadcast (OR = 0.78, 95% CI [0.73, 0.83], p < .001), whereas a MUST-use self-questioning policy increases risk (OR = 1.42, [1.35, 1.50], p < .001). Hedonic and age personas strongly modulate risk: vs. crave, like (OR = 0.28), none (0.19), and neutral (0.03) reduce hazard; vs. adult, child increases hazard (OR = 66.3), and senior is elevated (7.55) (all p < .001). On average, agents ask ≈ 7.12 questions and hit the per-step budget in ≈ 6% of minutes; question-asking declines faster under broadcast than isolation. Further ablation experiments demonstrated that removing hedonic drive and/or persona age systematically increases survival and completion, narrows the broadcast/isolated gap, and leaves the must vs. may ordering intact (must is riskier); the combined ablation (no hedonic + no persona age) yields the highest completion (approaching 1.0) and distinct tool-usage dynamics with higher initial questioning rates that gradually decrease over time. These results establish delay-of-gratification as a compact multi-turn interaction benchmark that captures social contagion and tool-use dynamics in LLM agents, offering a reproducible testbed and statistics to analyze long-horizon, multi-agent behavior.