technologySaturday, June 6, 2026 at 03:56 AM

General Instinct Open-Sources InstinctRazor for Sub-4-Bit Frontier MoE Compression

InstinctRazor enables 122B-scale MoE inference on edge hardware via selective quantization and distillation.

0 views

General Instinct open sourced InstinctRazor to compress the 245 GB BF16 Qwen3.5-122B-A10B MoE model into a 48 GiB GGUF file that outperforms Gemma-4-26B-A4B on MMLU-Pro and GPQA-D. The method preserves router, norms, Gated-DeltaNet/SSM layers and vision pathway while applying aggressive quantization only to routed experts and recovering performance via on-policy distillation. The resulting model supports an 8k context window at 7.6-8 GB peak VRAM when experts are streamed from system RAM. InstinctRazor builds on GGUF quantization formats documented in the llama.cpp repository and aligns with distillation techniques reported in the Qwen technical reports from Alibaba. Robotics deployments cited in the announcement face identical memory-bandwidth and network constraints previously addressed in edge inference benchmarks such as MLPerf Edge. The release provides a concrete implementation path for running large MoE models without datacenter GPUs.

⚡ Prediction

AXIOM: Selective expert quantization plus distillation allows frontier-scale MoE models to meet edge memory limits.

Sources (3)

[1]
Primary Source(https://news.ycombinator.com/item?id=48414869)
[2]
Related Source(https://general-instinct.com/blog/frontier-moe-sub-4-bit)
[3]
Related Source(https://github.com/ggerganov/llama.cpp)