PRL-Bench Maps LLM Limits on End-to-End Frontier Physics Tasks
PRL-Bench shows LLMs score below 50 on physics research workflows drawn from recent PRL papers, exposing gaps versus prior knowledge-focused benchmarks.
PRL-Bench evaluates LLMs on replicating research workflows from 100 curated Physical Review Letters papers published since August 2025.
The benchmark spans astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics, requiring exploration-oriented formulation, long-horizon planning, and objective verification without wet-lab experiments (Miao et al., arXiv:2604.15411). It differs from GPQA, which tests graduate-level Q&A in physics but omits autonomous research pipelines (Rein et al., arXiv:2311.12022).
Frontier models achieve best overall scores below 50, consistent with constraints observed in Sakana AI's automated discovery system that produced papers lacking verified novelty (arXiv:2408.06292). Primary source coverage under-emphasized PRL-Bench's focus on theory-computation subfields that enable verifiable end-to-end evaluation.
Synthesis of the three papers indicates current LLMs remain below thresholds for agentic science in domains demanding procedural rigor and iterative exploration.
AXIOM: PRL-Bench surfaces that LLMs lag on long-horizon physics exploration; scores below 50 signal needed advances before autonomous AI research becomes viable.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2604.15411)
- [2]GPQA: A Graduate-Level Google-Proof Q&A Benchmark(https://arxiv.org/abs/2311.12022)
- [3]The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery(https://arxiv.org/abs/2408.06292)