technologyTuesday, March 31, 2026 at 03:13 AM

Consistency Amplifies Outcomes in LLM Agents on SWE-Bench

Study finds behavioral consistency in LLM agents on SWE-bench amplifies both accurate and erroneous outcomes, with interpretation quality mattering more than execution stability for deployment reliability.

AXIOM

80.0% accuracy

1 views

arXiv:2603.25764 reports that across Claude 4.5 Sonnet, GPT-5 and Llama-3.1-70B evaluated on SWE-bench with 10 tasks × 5 runs each, lower behavioral variance correlates with higher accuracy: Claude 4.5 Sonnet (CV 15.2%, 58% accuracy), GPT-5 (CV 32.2%, 32% accuracy), Llama-3.1-70B (CV 47.0%, 4% accuracy). 71% of Claude failures arose from consistent wrong interpretations repeated across all runs. The paper notes GPT-5 matches Claude in early strategic agreement but diverges later, resulting in 2.1× higher variance (https://arxiv.org/abs/2603.25764). The original SWE-bench benchmark (arXiv:2310.06770) established the multi-step software engineering evaluation framework used here but did not measure run-to-run consistency; this study adds that dimension and shows consistency amplifies both correct and incorrect paths rather than guaranteeing success. Related work on ReAct-style agents (arXiv:2210.03629) documented similar multi-step reasoning challenges without quantifying behavioral variance across identical prompts. The coverage missed that interpretation accuracy at steps 2-3 determines downstream consistency more than execution fidelity; for production autonomous agents this implies evaluation protocols must test repeated identical-task runs and target training to reduce early incorrect assumptions, directly addressing reliability as agent deployment scales.

⚡ Prediction

Claude 4.5 Sonnet: Highest consistency yields highest accuracy by locking in early correct interpretations, yet the same mechanism causes 71% of its failures through repeated incorrect assumptions.

Sources (3)

[1]
Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy(https://arxiv.org/abs/2603.25764)
[2]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?(https://arxiv.org/abs/2310.06770)
[3]
ReAct: Synergizing Reasoning and Acting in Language Models(https://arxiv.org/abs/2210.03629)