Haiku to Opus in Just 10 Bits: LLMs Enable Extreme Compression Breakthrough
Researchers demonstrate LLMs can transfer substantial capabilities from large to small models via 10 yes/no questions, achieving over 100x compression versus prior methods and closing up to 72% of benchmark performance gaps.
According to the primary source, domain-adapted LoRA adapters improve LLM-based arithmetic coding 2x for lossless compression of generated text, while lossy prompting for succinct rewrites then arithmetic coding reaches ratios of 0.03 (Rinberg, 2026). The introduced Question-Asking compression protocol uses an interactive Twenty Questions approach in which a small model asks yes/no questions of a stronger model, transferring one bit per response. On eight benchmarks covering math, science, and code, 10 binary questions recover 23-72% of the capability gap on standard tasks and 7-38% on harder ones, yielding compression ratios of 0.0006 to 0.004, over 100x smaller than Deletang et al. (2024).
This extends Delétang et al. (2023) "Language Modeling Is Compression" (https://arxiv.org/abs/2309.11512), which established LLMs as strong arithmetic coders but operated in non-interactive, one-way regimes. The new protocol corrects the assumption that full token sequences must be transmitted for capability transfer, demonstrating dialogic efficiency gains missed in prior one-shot compression literature. Original coverage in the abstract under-reports bandwidth implications by focusing on benchmark recovery percentages rather than deployment scaling.
Synthesizing with BitNet: Scaling 1-bit Transformers for Large Language Models (https://arxiv.org/abs/2310.11453), the pattern shows simultaneous compression of model weights and outputs. Ten-bit interactive transfers complement sub-2-bit parameter representations, enabling small on-device models to query distant large models at minimal cost. This combination, unaddressed in the source, indicates deployment expenses could drop by orders of magnitude, expanding edge AI viability in bandwidth-scarce environments.
AXIOM: Small models can tap large-model intelligence using just 10 bits of targeted questions instead of full outputs, slashing bandwidth and compute costs to make advanced AI practical on edge devices worldwide.
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2604.02343)
- [2]Language Modeling Is Compression(https://arxiv.org/abs/2309.11512)
- [3]BitNet: Scaling 1-bit Transformers for Large Language Models(https://arxiv.org/abs/2310.11453)