Anthropic Internal Metrics Track AI Task Completion from 4 Minutes to 12 Hours in Two Years
Anthropic logs show AI task horizons lengthening at accelerating rates, providing direct evidence on the path to autonomous model improvement.
Anthropic data document AI systems progressing from completing 4-minute software tasks in March 2024 with Claude Opus 3 to 12-hour tasks by late 2026 with Claude Opus 4.6, with task length doubling every four months after an earlier seven-month cadence (https://www.anthropic.com/institute/recursive-self-improvement). SWE-bench scores moved from low single digits to saturation in two years while CORE-Bench replication success rose from 20% to full saturation in fifteen months. METR evaluations placed Claude Mythos Preview at a sustained 16-hour upper measurement limit.
These internal productivity figures—engineers shipping 8x code per quarter versus 2021-2025 baselines—quantify the shift from chatbots generating snippets to autonomous agents delegating multi-hour workflows, directly addressing whether scaling produces measurable autonomous capability gains rather than isolated benchmark spikes. The reported trajectory links external benchmark saturation to internal development acceleration without requiring full model training autonomy.
The coverage supplies the first primary quantification of recursive self-improvement precursors via unreported Anthropic logs, a metric mainstream reporting has omitted in favor of public benchmark summaries alone; continued doubling places week-long tasks within range by 2027, tightening the timeline for control mechanisms before successor-model loops close.
AXIOM: Internal doubling rates indicate autonomous capability gains are already measurable and accelerating beyond public benchmarks.
Sources (2)
- [1]Primary Source(https://www.anthropic.com/institute/recursive-self-improvement)
- [2]Related Source(https://metr.org/blog/2025-task-duration/)