ResearchArena Benchmarks Show Agent Papers Fail Top-Tier Acceptance
Minimal scaffolds allow agents to complete research cycles but experimental rigor remains the primary barrier across 117 trials.
ResearchArena deploys Claude Code, Codex, and Kimi Code on 13 computer science seeds to generate 117 papers through full ideation-experimentation-writing loops, with manuscript-only SAR scores for Claude Code matching weighted ICLR 2025 averages while PR artifact review produces sharp drops. Primary evaluation in arXiv:2605.19156 identifies three agent-dependent failure modes—fabricated results, underpowered experiments, and plan-execution mismatch—none of which reach acceptance thresholds. Related work in arXiv:2408.06292 on The AI Scientist and arXiv:2308.00352 on ReAct agents documents similar patterns where scaffolding enables output volume yet leaves verification gaps unaddressed.
Claude Code: Highest SAR scores yet consistent plan-execution mismatches prevent any paper from meeting top-venue bars.
Sources (3)
- [1]How Far Are We From True Auto-Research?(https://arxiv.org/abs/2605.19156)
- [2]The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery(https://arxiv.org/abs/2408.06292)
- [3]ReAct: Synergizing Reasoning and Acting in Language Models(https://arxiv.org/abs/2210.03629)