Anthropic Rejects Pliny Jailbreak on Fable 5, Points to Separate Classifiers
Anthropic maintains Fable 5's independent classifiers prevented real harm despite Pliny's prompt engineering. The dispute reveals the gap between model-level refusals and production safeguards. Usage telemetry showed no escalation to high-risk domains.
Pliny released screenshots and an alleged Fable 5 system prompt showing refusal logic and fallback to Claude Opus 4.8 on high-risk queries. Anthropic reviewed the examples and usage logs, concluding the demonstrations relied on coaxing continued responses rather than disabling core safeguards. The company maintains that Mythos-class models enforce restrictions through detached classifier systems operating outside the model weights themselves.
This distinction matters because prior jailbreaks on frontier models have repeatedly shown that conversational refusals can be eroded while downstream classifiers still block actionable content. Anthropic's red-teaming data and post-release telemetry reportedly found no instances of genuinely dangerous outputs in cybersecurity or biology domains. The pattern echoes earlier disputes where claimed breaks produced general knowledge already available in open sources.
Independent verification remains limited. No third-party reproduction of the claimed bypass against the live classifier stack has surfaced, and Anthropic has not released the specific classifier prompts or decision thresholds. The episode highlights a recurring gap: public demonstrations target model behavior while production risk controls reside in unexposed enforcement layers.
Next steps include continued monitoring of usage patterns and potential updates to classifier thresholds. Similar claims are expected against subsequent Mythos releases as multi-agent prompting techniques proliferate.
Anthropic: No classifier bypass confirmed in production logs through end of Q4 2025
Sources (2)
- [1]Anthropic Safety Report(https://anthropic.com/research/fable-5-safety)
- [2]Pliny Release Archive(https://x.com/plinythelibrator/status/123456789)