AIRA_2 Addresses Three Structural Bottlenecks in AI Research Agents
The paper identifies three bottlenecks in existing AI research agents: synchronous single-GPU execution that constrains sample throughput and limits search benefits, a generalization gap where validation-based selection degrades performance over extended horizons, and limited capability of fixed single-turn LLM operators that impose a performance ceiling (arXiv:2603.26499).
AIRA_2 implements an asynchronous multi-GPU worker pool for linear throughput gains, a Hidden Consistent Evaluation protocol for reliable signals, and ReAct agents that dynamically scope actions and debug interactively, achieving a mean Percentile Rank of 71.8% at 24 hours on MLE-bench-30, surpassing the prior best of 69.9%, and 76.0% at 72 hours (arXiv:2603.26499).
Ablation studies in the paper confirm each component is necessary and determine that overfitting reported in prior work resulted from evaluation noise rather than data memorization (arXiv:2603.26499).
Sources (1)
- [1]AIRA_2: Overcoming Bottlenecks in AI Research Agents(https://arxiv.org/abs/2603.26499)