AgingBench Tracks Reliability Decay in Persistent AI Agents
Longitudinal benchmark demonstrates agent aging requires mechanism-level diagnosis beyond day-one evaluation.
AgingBench evaluates agent reliability across 8-200 sessions in 7 scenarios using 14 models and temporal dependency graphs, showing degradation forms vary by memory pipeline stage. The arXiv paper (https://arxiv.org/abs/2605.26302) reports compression aging, interference aging, revision aging, and maintenance aging as distinct mechanisms, with factual precision decaying even when behavioral tests remain clean and derived-state tracking collapsing within single models. Paired counterfactual probes isolate failures at write, retrieval, and utilization stages; results across runner-controlled and autonomous agents indicate repairs must target specific stages rather than base model weights alone. Related work on long-context memory systems confirms interaction history compression alters effective state independent of frozen weights.
AgingBench: Stage-targeted repairs outperform model retraining for sustained factual precision in memory pipelines.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2605.26302)
- [2]Related Source(https://arxiv.org/abs/2310.08560)
- [3]Related Source(https://arxiv.org/abs/2403.05530)