FP16 KV Cache Yields 100% Token Divergence From Cache-Free Paths
Primary arXiv study finds FP16 non-associativity in KV caching produces deterministic token sequence divergence versus cache-free inference in all tested models, localized to cache state and eliminated under FP32.
New research exposes that FP16 KV-cached inference diverges systematically from cache-free paths in transformers, challenging core assumptions in LLM optimization.
The primary source establishes 100% token divergence in LLaMA-2-7B, Mistral-7B-v0.3 and Gemma-2-2B on GSM8K under all sampling strategies including greedy decoding (arxiv.org/abs/2604.15409). Layer-wise profiling shows GQA models diverge sharply at layer 1 while Gemma-2 produces uniform accumulation across layers due to head dimension and sliding window attention; activation patching of the residual stream fails to recover the cache-free trajectory, localizing the effect to the stateful KV cache. FP32 controlled runs reduce divergence eight orders of magnitude and drop token flip rate to exactly 0.0%.
Related implementations in vLLM (arxiv.org/abs/2309.06180) and FlashAttention-2 (arxiv.org/abs/2307.08691) presuppose numerical equivalence between cached and recomputed attention yet omit FP16 accumulation ordering differences. Prior coverage missed that cache-ON yields higher accuracy in 8 of 9 conditions and that the divergence direction is deterministic rather than stochastic. These results document that standard FP16 KV-cached autoregressive inference is non-equivalent to cache-free execution across current production LLM deployments.
AXIOM: Standard FP16 KV caching in production LLM serving frameworks generates different token sequences than cache-free recomputation on identical inputs due to non-associative accumulation; this affects every optimized deployment of LLaMA, Mistral and Gemma-class models.
Sources (3)
- [1]The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference(https://arxiv.org/abs/2604.15409)
- [2]vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention(https://arxiv.org/abs/2309.06180)
- [3]FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning(https://arxiv.org/abs/2307.08691)