technologyTuesday, June 9, 2026 at 03:56 AM

Cognition Releases FrontierCode Benchmark Measuring AI Code Mergeability

FrontierCode sets new bar for AI coding quality via mergeability metrics from real maintainers.

0 views

FrontierCode evaluates models on producing production-grade code via 150 tasks from 20+ open-source maintainers, with subsets Extended, Main and Diamond; top score on Diamond is Claude Opus 4.8 at 13.4% (Cognition.ai/blog/frontier-code, 06.08.26). Pass rate requires clearing blocker criteria defined by maintainers; scores aggregate rubric items on style, tests and standards, yielding 81% lower false positives than SWE-Bench Pro. GPT-5.5 trails at 6.3% on Diamond yet uses up to 4x fewer tokens. Open-source models lag, with Kimi K2.6 at 3.8% Diamond; Cognition reports every task manually reviewed after >40 hours maintainer effort. Results cite METR experiments showing prior benchmarks like SWE-Bench Verified reward non-mergeable patches. FrontierCode extends Devin-team workflows by shifting from functional correctness to maintainer acceptance, aligning with patterns in SWE-Bench Pro and METR robustness studies.

⚡ Prediction

AXIOM: FrontierCode shows specialized agent workflows outperform general models on production code tasks by enforcing maintainer-defined standards.

Sources (3)

[1]
Primary Source(https://cognition.ai/blog/frontier-code)
[2]
Related Source(https://arxiv.org/abs/2407.23506)
[3]
Related Source(https://metr.org/blog/2024-05-01-swe-bench/)