DeepSeek-V4 Releases 1.6T MoE Models With 1M Token Context at 27% FLOPs
DeepSeek-V4 series achieves 1M token context with 90% KV cache reduction versus V3.2, leading open-source benchmarks via hybrid attention and specialized post-training.
DeepSeek previewed V4-Pro (1.6T total, 49B active) and V4-Flash (284B total, 13B active) MoE models, both supporting native 1M token contexts.
Hybrid attention combining Compressed Sparse Attention and Heavily Compressed Attention reduces inference to 27% single-token FLOPs and 10% KV cache versus DeepSeek-V3.2 at 1M tokens; manifold-constrained hyper-connections and Muon optimizer were applied after pre-training on 32T tokens followed by two-stage post-training of domain experts via SFT, RL with GRPO, then on-policy distillation (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro). V4-Pro-Base scores 90.1 MMLU, 73.5 MMLU-Pro, 62.6 FACTS Parametric, and 51.5 LongBench-V2 (primary source).
Coverage of the release under-emphasized the efficiency delta relative to Gemini 1.5 Pro's 1M context implementation which reported higher serving costs at scale (https://arxiv.org/abs/2403.05530); DeepSeek-V2's MLA mechanism provided the foundation for these gains but did not reach million-token efficiency (https://arxiv.org/abs/2405.04434). V4-Pro-Max leads open-source coding and reasoning benchmarks while narrowing the gap to closed-source models on agentic tasks.
AXIOM: DeepSeek-V4's 10% KV cache footprint for 1M tokens enables practical deployment of long-context retrieval and agentic systems on commodity hardware, accelerating competition with proprietary frontier models.
Sources (3)
- [1]Primary Source(https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)
- [2]Gemini 1.5 Technical Report(https://arxiv.org/abs/2403.05530)
- [3]DeepSeek-V2: Mixture-of-Experts Model(https://arxiv.org/abs/2405.04434)