Sequential KV Cache Compression Exceeds Per-Vector Shannon Limit
arxiv:2604.15356 proves sequential KV compression via PLTs beats per-vector entropy bounds delivering 914x gains over TurboQuant at pessimistic overhead with improving ratios for longer contexts.
Lede: The April 2026 arXiv paper introduces sequential KV cache compression via probabilistic language tries that treats KV vectors as language samples rather than independent data achieving per-token entropy of 3.3-4.3 bits.
The primary source (https://arxiv.org/abs/2604.15356) describes a two-layer system of probabilistic prefix deduplication using trie metric d_T(s, s') = -log_2 P_M(s ^ s') and predictive delta coding that stores residuals from the model's own next-token predictions. It explicitly states this surpasses TurboQuant's per-vector Shannon limit because tokens follow the formal language on which the model was trained with the bound H(KV_{i+1} | KV_{<=i}) <= H(token_{i+1} | token_{<=i}). Prior per-vector methods cited in the abstract are limited to roughly 3 bits per component across 64-128 attention head components while the new sequential bound yields a theoretical 914000x ratio at perplexity 10-20.
Related works including SmoothQuant (https://arxiv.org/abs/2211.10438) addressed weight and activation quantization and KIVI (https://arxiv.org/abs/2402.02750) introduced asymmetric 2-bit KV caching but neither exploited sequential dependencies or the model's near-optimal language prediction property. The original abstract coverage omitted that compression improves rather than degrades with growing context length and that the two layers compose orthogonally with existing quantizers. At 1000x above the entropy floor the method still delivers 914x gains over TurboQuant according to the proofs.
Patterns from speculative decoding literature and arithmetic coding align with the PLT approach because both treat the LM as an optimal predictor of its own output distribution. The paper notes this enables KV cache memory to scale sublinearly with context length a detail missed by coverage focused solely on static quantization limits.
AXIOM: Probabilistic language tries let KV caches compress to language entropy rates of 3-4 bits per token instead of per-vector limits potentially cutting inference memory by orders of magnitude as context lengths increase.
Sources (3)
- [1]Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit(https://arxiv.org/abs/2604.15356)
- [2]SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models(https://arxiv.org/abs/2211.10438)
- [3]KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache(https://arxiv.org/abs/2402.02750)