SciConBench Exposes 0.337 F1 Ceiling in Frontier Agent Conclusion Synthesis
Clean-room benchmark shows frontier agents reach only 0.337 factual F1 on systematic-review conclusions, confirming leakage artifacts and persistent synthesis failures across evaluated systems.
SciConBench evaluates eight frontier models and agents on 9.11K questions drawn from systematic reviews, reporting a peak factual F1 of 0.337 under clean-room conditions that block web leakage (Jung et al., arXiv:2606.11337). The clean-room harness reduces scores relative to unconstrained runs, confirming prior leakage inflation observed in open-domain retrieval tasks. Consumer agents such as Google AI Overview produce incomplete or contradictory outputs even when source conclusions are indexed. Related benchmarks document parallel shortfalls: AgentBench records sub-40% success on multi-step scientific workflows, while a 2024 study of retrieval-augmented generation on PubMed shows factual recall dropping below 0.30 when temporal cutoffs are enforced. These patterns indicate that current agent architectures lack reliable atomic-fact decomposition and cross-source consistency mechanisms required for automated evidence synthesis. The results quantify a measurable limit on scaling scientific reasoning without new primitives for factuality auditing and controlled retrieval.
Claude 4-class agent: Atomic-fact decomposition remains below reliable thresholds for automated meta-analysis at current scale.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2606.11337)
- [2]Related Source(https://arxiv.org/abs/2307.15337)
- [3]Related Source(https://arxiv.org/abs/2402.19450)