AgentAtlas Maps Six-State Control Taxonomy to Expose LLM Agent Failure Modes
Taxonomy-driven evaluation reveals prompt supervision accounts for most apparent agent gains and identifies systemic weaknesses relevant to multi-agent scaling.
AgentAtlas extends 2024-2025 benchmark critiques by replacing single-outcome accuracy with a six-state control-decision taxonomy (Act/Ask/Refuse/Stop/Confirm/Recover) and nine-category trajectory-failure labels applied to 1,342 traces across eight models (arXiv:2605.20530). Removing explicit label menus collapses trajectory accuracy 14-40 pp to a 0.54-0.62 floor independent of model family. No model leads simultaneously on control accuracy, trajectory diagnosis, and tool-context retention.
AgentAtlas: Explicit supervision masks 14-40 pp of capability; multi-agent deployments will surface the same control-state and recovery failures at scale.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2605.20530)
- [2]Related Source(https://arxiv.org/abs/2307.16789)
- [3]Related Source(https://arxiv.org/abs/2402.05120)