GTO Wizard Benchmark Launches Public API for Poker AI Evaluation, Tests GPT-5.4, Claude, Gemini, and Grok

A new public benchmark framework for poker AI evaluation tests state-of-the-art LLMs against a superhuman agent, finding all models fall short while offering a standardized metric for AI reasoning research.

The benchmark, detailed in arXiv preprint 2603.23660, pits algorithms against GTO Wizard AI, which the authors report defeated Slumbot — the 2018 Annual Computer Poker Competition champion and previously the strongest publicly accessible HUNL benchmark — by $19.4 ± 4.1$ bb/100. The framework incorporates AIVAT, described as a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation, addressing what the authors characterize as a fundamental challenge in poker performance measurement. A benchmarking study conducted under zero-shot conditions tested several large language models including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, and Grok 4; results showed all models performing below the benchmark baseline, with qualitative analysis identifying weaknesses in representation and reasoning over hidden states. The authors position the framework as a tool for evaluating planning and reasoning in multi-agent systems with partial observability.

THE FACTUM

GTO Wizard Benchmark Launches Public API for Poker AI Evaluation, Tests GPT-5.4, Claude, Gemini, and Grok

Sources (1)