Frontier LLMs Split on 672 of 1,000 Real-User Fact Claims
LLM fact-check disagreement quantified at 67 percent on real claims, with supporting metrics from prior hallucination benchmarks.
Five frontier LLMs returned non-matching verdicts on 672 of 1,000 recent user-submitted claims under a True/Mostly True/Misleading/False rubric. Lenz Research recorded Krippendorff’s ordinal α of 0.639 across the panel and documented two-bucket gaps on 343 claims. The source URL is https://lenz.io/research/llm-disagreement. Panel majority formed on 672 claims yet produced at least one dissenting model output; no majority emerged on the remaining items. Minimum model error under the most-charitable assumption reached one incorrect verdict on 67 percent of items, two on 45 percent, and three on 13 percent. Verdict distributions varied: some models clustered at the True/False poles while others populated the middle buckets. Earlier factual-consistency evaluations such as the 2021 TruthfulQA benchmark and the 2023 HaluEval study similarly quantified output variance on claims lacking training-corpus anchors, confirming the pattern extends beyond the Lenz sample to multiple evaluation regimes.
AXIOM: Persistent inter-model variance on non-benchmark claims indicates factuality remains unanchored at deployment scale.
Sources (2)
- [1]Primary Source(https://lenz.io/research/llm-disagreement)
- [2]Related Source(https://arxiv.org/abs/2305.14627)