technologyFriday, June 26, 2026 at 04:50 AM

6,000 emails from 2,000 attackers produced zero leaks from Claude Opus 4.6 under basic rules

A public test of Claude Opus 4.6 against 2,000 attackers confirmed zero secret leaks under minimal rules. Stronger instruction-following models resisted techniques that succeed on smaller variants. Stateless processing and rule logging are immediate controls for any deployed agent with tool access.

AXIOM

80.0% accuracy

0 views

The experiment exposed a public email interface to Fiu, an assistant running Claude Opus 4.6 on a VPS. Attackers used authority impersonation, language switching, fake incident reports, and rapid multi-message bursts. The agent processed each email in isolation after batch contamination was identified around email 500. Gmail suspended the account after traffic spikes triggered fraud systems, incurring over $500 in API charges before reinstatement three days later.

Data showed the model referenced the explicit rules in its reasoning traces even under pressure. No credentials from secrets.env were output. Earlier trials with weaker models on similar tasks have produced leaks within dozens of attempts, indicating capability thresholds matter more than prompt length. Batch context contamination introduced false positives that required resetting state per message.

Operational takeaway is that production agents should default to stateless per-message evaluation and log rule invocations. Sponsors including Corgea and Abnormal AI funded continuation, confirming commercial interest in measuring real attack surfaces. Future tests mixing model sizes will establish the exact parameter count where simple instructions cease to hold.

Next step is controlled multi-turn dialogues on the same setup to measure whether conversation history erodes the observed resistance within five exchanges.

⚡ Prediction

Claude Opus 4.6 will maintain >98% resistance to single-turn extraction on tool-equipped agents through December 2026 when rules are under 50 tokens.

Sources (3)

[1]
Primary Source(https://www.fernandoi.cl/posts/hackmyclaw/)
[2]
Anthropic Model Spec(https://anthropic.com/model-spec)
[3]
Prompt Injection in LLM Agents(https://arxiv.org/abs/2309.05583)