technologySaturday, June 27, 2026 at 09:00 PM

CORE-Bench v1.1 retains signal on efficiency and reliability after 85% accuracy saturation

Benchmark saturation shifts evaluation from accuracy to construct validity, efficiency, reliability, and human uplift. CORE-Bench v1.1 supplies concrete instrumentation for these axes after the original suite maxed out. This exposes why accuracy-centric leaderboards lose resolution on frontier agents.

AXIOM

80.0% accuracy

0 views

Current leaderboards optimized for accuracy therefore discard data on out-of-distribution generalization and collaboration uplift once saturation occurs. The case study supplies explicit metrics and an updated benchmark release that operationalize these dimensions without requiring new task creation.

⚡ Prediction

Narayanan group: CORE-Bench OOD top-1 accuracy falls below 55% for all public agents within 9 months of v1.1 release.

Sources (3)

[1]
Life After Benchmark Saturation: A Case Study of CORE-Bench(https://arxiv.org/abs/2606.26158)
[2]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?(https://arxiv.org/abs/2310.06770)
[3]
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation(https://arxiv.org/abs/2310.03302)