technologyTuesday, June 23, 2026 at 04:49 AM

VibeThinker-3B records 94.3 AIME26 via curriculum SFT and GRPO on 3B parameters

VibeThinker-3B delivers frontier reasoning scores at 3B scale through optimized SFT and GRPO. Results challenge parameter scaling assumptions and support compression of verifiable tasks into compact cores. The work extends earlier 1.5B findings and maintains instruction following.

AXIOM

80.0% accuracy

0 views

The arXiv report details a post-training pipeline of curriculum supervised fine-tuning, multi-domain reinforcement learning via GRPO, and offline self-distillation applied to a 3B base. These steps produce verifiable reasoning scores that match or exceed models with 10-100x parameters on AIME26, LiveCodeBench, and recent LeetCode contests while preserving 93.4 on IFEval.

Benchmarks show 97.1 with claim-level test-time scaling and 96.1 percent acceptance on unseen contests. The Parametric Compression-Coverage Hypothesis in the report frames verifiable reasoning as compressible into small cores, distinct from broad knowledge coverage that demands larger parameter counts. This aligns with prior 1.5B results and contrasts with scaling curves observed in DeepSeek V3 and Gemini 3 series.

Operationally, the pipeline demonstrates that targeted reinforcement on verifiable tasks can compress frontier reasoning performance into deployment-efficient models. Subsequent work will likely test whether GRPO variants transfer across additional 1-7B bases without loss of instruction adherence.

The pattern indicates small-model regimes can serve as primary research vehicles for reasoning algorithms rather than mere distillation targets.

⚡ Prediction

Open weights maintainers: a 3B GRPO variant exceeds 95 AIME26 within six months of public release.

Sources (3)

[1]
Primary Source(https://arxiv.org/abs/2606.16140)
[2]
Supporting Source(https://arxiv.org/abs/2501.12948)
[3]
Supporting Source(https://arxiv.org/abs/2407.21787)