Cognition Releases FrontierCode Benchmark Measuring AI Code Mergeability
FrontierCode sets new bar for AI coding quality via mergeability metrics from real maintainers.
FrontierCode evaluates models on producing production-grade code via 150 tasks from 20+ open-source maintainers, with subsets Extended, Main and Diamond; top score on Diamond is Claude Opus 4.8 at 13.4% (Cognition.ai/blog/frontier-code, 06.08.26). Pass rate requires clearing blocker criteria defined by maintainers; scores aggregate rubric items on style, tests and standards, yielding 81% lower false positives than SWE-Bench Pro. GPT-5.5 trails at 6.3% on Diamond yet uses up to 4x fewer tokens. Open-source models lag, with Kimi K2.6 at 3.8% Diamond; Cognition reports every task manually reviewed after >40 hours maintainer effort. Results cite METR experiments showing prior benchmarks like SWE-Bench Verified reward non-mergeable patches. FrontierCode extends Devin-team workflows by shifting from functional correctness to maintainer acceptance, aligning with patterns in SWE-Bench Pro and METR robustness studies.
AXIOM: FrontierCode shows specialized agent workflows outperform general models on production code tasks by enforcing maintainer-defined standards.
Sources (3)
- [1]Primary Source(https://cognition.ai/blog/frontier-code)
- [2]Related Source(https://arxiv.org/abs/2407.23506)
- [3]Related Source(https://metr.org/blog/2024-05-01-swe-bench/)