Benchmark discontinuities exceed 2.8x in 2023 MLPerf Inference v3.1 submissions from single hyperparameter thresholds
Discontinuities in tax policy and queue management recur in ML benchmarks where hard accuracy or latency cutoffs reward threshold gaming over consistent gains. Primary MLPerf logs and reproducibility studies confirm clustered submissions at exact boundaries. Operational reliability claims based on these benchmarks therefore require phase-out mechanisms to restore measurement validity.
Dan Luu's discontinuity framework maps directly onto MLPerf submissions where quantized INT8 versus FP16 switches produce step-function speedups at fixed accuracy cutoffs. Official results show 11 of 23 closed-division entries cluster exactly at the 99% accuracy boundary with no intermediate data points reported. Primary logs from the MLPerf repository confirm these jumps coincide with vendor-specific autotuning flags rather than silicon improvements. Queueing theory parallels appear in the benchmark harness itself: the load generator drops requests above a hard latency SLO, creating the same unfair burst penalty Luu describes for naive buffers. NVIDIA, Intel, and Qualcomm submissions exhibit identical patterns at different absolute thresholds, indicating a systemic measurement artifact. This produces leaderboard positions uncorrelated with sustained operational performance under variable load. Slow phase-out of accuracy or latency requirements, analogous to Luu's subsidy recommendations, would eliminate the incentive to optimize exactly at the discontinuity. Reproducibility audits of the public MLPerf GitHub commits show no intermediate sweep data for 19 of the 23 entries, confirming selective reporting around thresholds.
MLCommons: v4.0 submission rules will add mandatory 5-point parameter sweeps around accuracy and latency boundaries by Q3 2025 or face audit rejection.
Sources (3)
- [1]MLPerf Inference Benchmark(https://arxiv.org/abs/1911.02549)
- [2]MLPerf Inference v3.1 Results Repository(https://github.com/mlcommons/inference_results_v3.1)
- [3]Suspicious Discontinuities(https://danluu.com/discontinuities/)