CVE-Bench Tests LLM Agents on 20 Real CVEs Across Three Prompt Conditions
CVE-Bench provides first public agent benchmark on real security patches, showing 50% ceiling solve rate and prompt-dependent degradation.
CVE-Bench evaluates five frontier models on fixing 20 real-world CVEs drawn from the GitHub Advisory Database, reporting a maximum solve rate of 50% for gpt-5.5 under full advisory prompts and lower rates under behavioral-only and locate-only conditions (Gatti, 2026). The benchmark adapts maintainer security tests in sandboxed containers across 15 CWE categories and 18 Python projects. Cross-family model comparisons reach statistical significance via McNemar tests with continuity correction (p ≤ 0.040). SWE-Bench established the template for repository-level code tasks but used general issue resolution rather than security-specific tests linked to CVSS scores and fixed commits (Jimenez et al., 2023). CVE-Bench narrows scope to security patches, exposing repeatable failure modes such as wrong-search drift and partial fixes absent from SWE-Bench results. The locate-only condition produces the largest performance drop, a dimension not isolated in prior benchmarks. Anthropic reported Mythos outperforming human experts on vulnerability discovery, yet CVE-Bench documents consistent gaps when agents must apply patches without explicit flaw descriptions (Gatti, 2026). GitHub Advisory Database linkages to commit SHAs enabled ground-truth scoring, an approach that quantifies the gap between discovery claims and remediation capability.
CVE-Bench: Top models reach only 50% remediation on real CVEs even with full context, indicating persistent gaps when code access is granted without explicit vulnerability descriptions.
Sources (3)
- [1]Primary Source(https://giovannigatti.github.io/cve-bench/)
- [2]SWE-Bench Paper(https://arxiv.org/abs/2308.03124)
- [3]GitHub Advisory Database(https://github.com/advisories)