AI Weather Models Match Scores But Break Physics in Storms, Preprint Warns

Preprint case study of one cyclone finds Norway's Bris AI model matches RMSE but disrupts atmospheric physical balances via noise, unlike physics-based NWP. Limited to single event; raises reliability concerns for extremes.

A new preprint (not yet peer-reviewed) on arXiv (2604.01454) evaluates whether AI-based weather models can preserve essential physical balances in the atmosphere. Researchers from Met Norway examined their stretched-grid deep-learning weather prediction model called Bris during the severe extratropical cyclone Poly that struck the Netherlands on 5 July 2023.

The methodology was a targeted case study: they compared the deterministic version of Bris against the control run of the operational MetCoOp Ensemble Prediction System (MEPS), a traditional numerical weather prediction model. Analysis focused on deviations from key atmospheric balances, reproduction of expected storm dynamics, and mesoscale features rather than relying solely on standard metrics. Sample size is effectively a single event, which the authors acknowledge as a major limitation.

Despite competitive RMSE scores, Bris struggled to capture important details of the storm and introduced significant disruptions to physical balances. The root cause appears to be fine-scale noise in the model's output fields, which creates unrealistic spatial gradients. This finding aligns with patterns seen in other data-driven models that perform well on average conditions but falter on extremes.

What much existing coverage of AI weather forecasting misses is this gap between error metrics and physical fidelity. Traditional NWP models explicitly solve fluid dynamics equations, enforcing conservation laws. In contrast, models like Bris learn statistical patterns from historical data, potentially missing rare but critical dynamics.

Synthesizing related research, DeepMind's GraphCast (Nature, 2023) demonstrated skillful medium-range forecasts globally yet faced similar critiques regarding extreme-event performance and lack of explicit physics constraints. Likewise, NVIDIA's FourCastNet (arXiv:2202.11214) showed computational efficiency advantages but highlighted the need for improved physical consistency in follow-up studies.

This matters deeply as machine learning rapidly transforms forecasting and climate prediction. Extreme events are becoming more frequent with climate change, making physical reliability essential for trustworthy warnings. The preprint correctly flags that current AI approaches may require hybrid physics-informed designs or additional training strategies to avoid generating dynamically inconsistent predictions.

Limitations are clear: single-storm analysis, one regional model, and no evaluation of ensemble versions or longer time series. Results may not generalize across all DLWP systems. Still, it provides an important reality check beyond headline RMSE numbers.

THE FACTUM

AI Weather Models Match Scores But Break Physics in Storms, Preprint Warns

Sources (3)