PRL-Bench Maps LLM Limits on End-to-End Frontier Physics Tasks

PRL-Bench shows LLMs score below 50 on physics research workflows drawn from recent PRL papers, exposing gaps versus prior knowledge-focused benchmarks.

PRL-Bench evaluates LLMs on replicating research workflows from 100 curated Physical Review Letters papers published since August 2025.

The benchmark spans astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics, requiring exploration-oriented formulation, long-horizon planning, and objective verification without wet-lab experiments (Miao et al., arXiv:2604.15411). It differs from GPQA, which tests graduate-level Q&A in physics but omits autonomous research pipelines (Rein et al., arXiv:2311.12022).

Frontier models achieve best overall scores below 50, consistent with constraints observed in Sakana AI's automated discovery system that produced papers lacking verified novelty (arXiv:2408.06292). Primary source coverage under-emphasized PRL-Bench's focus on theory-computation subfields that enable verifiable end-to-end evaluation.

Synthesis of the three papers indicates current LLMs remain below thresholds for agentic science in domains demanding procedural rigor and iterative exploration.

THE FACTUM

PRL-Bench Maps LLM Limits on End-to-End Frontier Physics Tasks

Sources (3)