Gemma 4 KV Cache Sensitivity Exposed in KL Benchmarks, Qwen 3.6 Shows Robustness for Local Deployment
KL divergence tests show Gemma 4's unexpectedly high sensitivity to q8_0/q4_0 KV cache quantization versus Qwen 3.6 robustness, synthesizing primary benchmarks with KIVI research and llama.cpp updates to reveal independent, compounding errors critical for local LLM memory optimization.
Empirical KL divergence benchmarks on KV-cache quantization for Gemma 4 and Qwen 3.6 deliver actionable data for efficient local LLM deployment, addressing a critical but under-reported technical need in open-source AI.
The LocalBench Substack report establishes that q8_0 KV cache, widely viewed as practically lossless per earlier llama.cpp testing, produces KL divergence of 0.108 on Gemma 4 31B and 0.377 on the more sensitive 26B A4B variant when using BF16 GGUF from Unsloth with TurboQuant-inspired attention rotation applied automatically; Qwen 3.6 35B A3B models remain below 0.04 at q8_0 and 0.087-0.117 at q4_0 across 250k tokens spanning coding, science, tool calling and long documents (https://localbench.substack.com/p/kv-cache-quantization-benchmark). Cache and weight quantization losses are independent and compound, a nuance missed in most coverage that has centered on perplexity rather than token-by-token log-probability distribution shifts.
Synthesizing with the KIVI framework's asymmetric 2-bit KV cache results that reported minimal degradation on Llama-2/3 architectures (https://arxiv.org/abs/2402.14992) and community benchmarks on llama.cpp KV cache PRs that first integrated rotation corrections, the Gemma 4 MoE amplification effect stands out against Qwen's MoE stability, revealing architecture-specific error propagation patterns missed in the original piece; Gemma degrades uniformly across categories while Qwen confines nearly all loss to long documents (KL 0.581 at q4_0) and tool calling. Earlier weight-only quant studies on Gemma by Unsloth failed to isolate this cache dimension, understating cumulative quality impact for extended context RAG and agentic use.
These findings connect to the accelerating demand for long-context local inference on consumer GPUs where KV cache memory scales linearly with sequence length, indicating developers should default to Qwen 3.6 for q4_0 cache deployments to maintain top-1 accuracy above 60% while achieving 4x memory reduction; Gemma 4 requires f16 cache or further mitigation research, highlighting an under-reported selection criterion that will shape efficient open-source AI deployment strategies going forward.
AXIOM: Gemma 4's high KL divergence under q8_0 KV cache means it needs full precision cache for quality work while Qwen 3.6 tolerates aggressive quantization, giving developers clear guidance on model choice for memory-efficient local deployment.
Sources (3)
- [1]Primary Source(https://localbench.substack.com/p/kv-cache-quantization-benchmark)
- [2]KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache(https://arxiv.org/abs/2402.14992)
- [3]llama.cpp KV Cache Quantization Discussions(https://github.com/ggerganov/llama.cpp/discussions/7638)