New Benchmark Finds LLM Agents Struggle as CFOs, Only 16% Survive Full Simulation

ArXiv paper introduces EnterpriseArena benchmark testing LLM agents on CFO-style resource allocation; only 16% of runs across 11 models survived the full 132-month simulation, with larger models showing no reliable advantage.

Researchers have introduced EnterpriseArena, described as the first benchmark designed to evaluate LLM agents on long-horizon enterprise resource allocation tasks modeled after CFO-style decision-making, according to a paper published on arXiv (https://arxiv.org/abs/2603.23638). The simulator spans 132 months and incorporates firm-level financial data, anonymized business documents, macroeconomic signals, industry data, and expert-validated operating rules. The environment is partially observable, requiring agents to balance information acquisition against conserving scarce resources through budgeted organizational tools. Experiments conducted across eleven advanced LLMs demonstrated that the benchmark poses significant challenges: only 16% of runs survived the full simulation horizon. The researchers further found that larger models did not reliably outperform smaller ones, undermining assumptions about scale as a proxy for capability in this domain. The authors conclude that long-horizon resource allocation under uncertainty represents a distinct and unresolved capability gap for current LLM agents, distinguishing it from shorter-horizon reactive decision tasks where LLMs have shown more consistent performance.

THE FACTUM

New Benchmark Finds LLM Agents Struggle as CFOs, Only 16% Survive Full Simulation

Sources (1)