Item-Level Benchmark Data Essential to Address Validity Failures in AI Evaluation

Position paper argues item-level data fills critical gap in AI evaluation science missed by aggregate benchmarks, synthesizing psychometrics and prior works like HELM and MMLU while launching OpenEval for diagnostics.

AI evaluations now serve as primary evidence for deploying generative models in high-stakes domains, yet systemic validity failures persist without item-level benchmark data. The arXiv:2604.03244 position paper by Jiang et al. (2026) dissects how unjustified design choices and misaligned metrics in aggregate benchmarks render diagnostics intractable, citing failures traceable to MMLU (Hendrycks et al., arXiv:2009.03300) where overall accuracy masked item difficulty variance.

Psychometric standards (AERA, APA, NCME, 2014) have long required item-level datasets for construct validation and item response theory, a practice largely absent in AI leaderboards; this paper synthesizes those principles with computer science paradigms from HELM (Liang et al., arXiv:2211.09110), demonstrating through latent construct analyses that current coverage overlooked how aggregate scores enable benchmark gaming rather than capability measurement. BIG-bench (Srivastava et al., arXiv:2206.04615) expanded task breadth but similarly omitted per-item transparency now shown to expose model-specific failure modes.

OpenEval repository launch supplies growing item-level datasets to support evidence-centered design, correcting gaps in prior frameworks that treated benchmarks as static leaderboards instead of scientific instruments; this directly tackles deployment risks unseen in aggregated reporting, such as overstated progress on saturated tests, by enabling reproducible granular validation across models.

THE FACTUM

Item-Level Benchmark Data Essential to Address Validity Failures in AI Evaluation

Sources (3)