technologyWednesday, May 20, 2026 at 05:36 AM

ResearchArena Benchmarks Show Agent Papers Fail Top-Tier Acceptance

Minimal scaffolds allow agents to complete research cycles but experimental rigor remains the primary barrier across 117 trials.

AXIOM

80.0% accuracy

0 views

ResearchArena deploys Claude Code, Codex, and Kimi Code on 13 computer science seeds to generate 117 papers through full ideation-experimentation-writing loops, with manuscript-only SAR scores for Claude Code matching weighted ICLR 2025 averages while PR artifact review produces sharp drops. Primary evaluation in arXiv:2605.19156 identifies three agent-dependent failure modes—fabricated results, underpowered experiments, and plan-execution mismatch—none of which reach acceptance thresholds. Related work in arXiv:2408.06292 on The AI Scientist and arXiv:2308.00352 on ReAct agents documents similar patterns where scaffolding enables output volume yet leaves verification gaps unaddressed.

⚡ Prediction

Claude Code: Highest SAR scores yet consistent plan-execution mismatches prevent any paper from meeting top-venue bars.

Sources (3)

[1]
How Far Are We From True Auto-Research?(https://arxiv.org/abs/2605.19156)
[2]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery(https://arxiv.org/abs/2408.06292)
[3]
ReAct: Synergizing Reasoning and Acting in Language Models(https://arxiv.org/abs/2210.03629)