Claude Fable 5 & Claude Mythos 5 Benchmarks Explained

Jun 9, 2026·7 min·By Nicolas Zeeb

Best Practices

Claude Fable 5 & Claude Mythos 5 Benchmarks Explained

On June 9, 2026, Anthropic released Claude Fable 5: the first model from its "Mythos-class" tier, the level that now sits above the Opus class, that it has cleared for general use.

Alongside it came Claude Mythos 5, the same underlying model with some of its safeguards lifted, available only to a small group of cyber defenders and infrastructure providers.

The framing is not modest this time. Anthropic says Fable 5's capabilities exceed those of any model it has ever made generally available, and that it is state of the art on nearly every benchmark it tested. The more interesting detail is buried in the launch: this model is capable enough that, on certain topics, your query will quietly be answered by a different, weaker model instead.

Here’s the breakdown of the actual benchmarks and what actually matters about them.

1. Coding: the headline use case

Anthropic leads with software engineering, and the standout claim comes from Stripe's early testing. On a 50-million-line Ruby codebase, Fable 5 performed a codebase-wide migration in a day that would otherwise have taken a full team more than two months by hand.

The leaderboard backs the testimonial. On SWE-Bench Pro, Anthropic's agentic-coding benchmark, Fable 5 posts the top score of any model tested at 80.3%, ahead of Opus 4.8's 69.2%:

SWE-Bench Pro

Agentic coding, pass rate % (higher is better)

Fable 5

80.3

Mythos Preview

77.8

Opus 4.8

69.2

GPT-5.5

58.6

Gemini 3.1 Pro

54.2

Source: Claude Fable 5 / Mythos 5 benchmark comparison, Anthropic (June 9, 2026).

It is also more token-efficient than past Claude models. On Cognition's FrontierCode evaluation, which tests whether a model can pass difficult coding tasks while holding to production-codebase standards, Anthropic reports Fable 5 scores highest among frontier models even at medium effort, and on the hardest Diamond split it reaches 29.3%, more than double Opus 4.8's 13.4% and far ahead of GPT-5.5's 5.7%. Michael Truell, whose Cursor team built CursorBench, calls it "the state of the art model on CursorBench" and says it has opened up long-horizon problems that were out of reach before.

2. Knowledge work and vision

On Hebbia's Finance Benchmark for senior-level reasoning, Anthropic reports Fable 5 posts the highest score of any model, with gains in document reasoning, chart and table interpretation, and problem solving. The trading firm IMC said the model aced its trading-analysis evaluations nearly across the board.

Vision is where the leap is easiest to picture. Fable 5 can rebuild a web app's source code from screenshots alone, and it cleared Pokémon FireRed start to finish using only raw game screenshots, with no maps or navigation aids. Earlier Claude models needed a complex helper harness to play at all. The practical version of this shows up in customer tests: one CTO described apps that took a hundred prompts a year ago now getting one-shotted.

On GDP.pdf, a vision evaluation that asks a model to reason over a rendered document with no tools, Fable 5 leads the field at 29.8%, ahead of GPT-5.5's 24.9%:

GDP.pdf (vision, no tools)

Knowledge-work vision, % (higher is better)

Fable 5

29.8

GPT-5.5

24.9

Opus 4.8

22.5

Gemini 3.1 Pro

16.7

Source: Claude Fable 5 / Mythos 5 benchmark comparison, Anthropic (June 9, 2026).

3. Memory on long-running tasks

Fable 5 is built to hold focus across millions of tokens and to improve its own work using notes it keeps along the way. Anthropic's clearest illustration is a game: when it gave the model persistent file-based memory while playing the deck-builder Slay the Spire, performance improved three times more than it did for Opus 4.8, and Fable reached the final act three times as often.

This is the capability behind the testimonials about long-horizon autonomy. Matthew Pines, testing frontier physics research, reported Fable 5 got nearly to where GPT-5.5 landed after four days, in 36 hours. The pattern users keep describing is the same: it stays on task longer and validates its own work before declaring it done.

4. What Mythos 5 unlocks in science

The most striking results come from Mythos 5, the sibling model, used internally with biology safeguards removed. Anthropic's protein-design experts say it accelerated parts of the drug-design process by around ten times, with the model choosing binding sites, running design tools, and recovering from its own failures with no human assistance. Nine of the 14 protein targets in that study yielded strong candidates the company is now investigating.

That edge is visible on biology evaluations too. On BioMysteryBench, a hard biology-reasoning test, the unblocked Mythos 5 leads at 46.1%, above Opus 4.8's 40.0%:

BioMysteryBench (hard)

Biology reasoning, % (higher is better)

Mythos 5

46.1

Opus 4.8

40.0

Mythos Preview

29.6

Source: Anthropic (June 9, 2026). On biology queries, Fable 5 falls back to Opus 4.8, so this gap reflects the unblocked Mythos 5.

It also produced original science. In blinded comparisons against Opus-class models, Anthropic's scientists preferred Mythos 5's molecular biology hypotheses about 80% of the time, and one of its hypotheses, a novel mechanism for an E. coli protein, was independently corroborated by another lab.

In genomics, a model Mythos 5 trained on single-cell data across 138 species outperformed a recent model published in Science, despite being 100 times smaller.

The safeguard that answers for it

Here is the part of this launch that has no equivalent in the Opus releases.

Because Mythos-class capabilities in cybersecurity and biology could give real uplift to bad actors, Anthropic shipped Fable 5 with a set of classifiers that watch for requests touching cybersecurity, biology and chemistry, or model distillation. When one trips, the response is handled by Claude Opus 4.8 instead, and the user is told it happened.

The reason for all this caution shows up on cybersecurity evaluations, where the unblocked Mythos 5 scores 78.0%, nearly double Opus 4.8's 40.0%:

ExploitBench (capture %)

Cybersecurity, % (higher is better)

Mythos 5

78.0

Mythos Preview

69.0

Opus 4.8

40.0

GPT-5.5

34.0

Source: Anthropic (June 9, 2026). On cybersecurity queries, Fable 5 falls back to Opus 4.8, so this gap reflects the unblocked Mythos 5.

So the most capable public model Anthropic has ever shipped will, on a slice of topics, silently hand your question to a weaker one.

Anthropic tuned the classifiers conservatively to ship fast and safely, which means they sometimes catch harmless requests. The company says fallback triggers in fewer than 5% of sessions, and that more than 95% of sessions involve no fallback at all, in which case Fable performs effectively like Mythos 5. It is an honest trade, clearly disclosed, and worth knowing before you build a workflow on the model: roughly one in twenty sessions may not be running on the model you think it is.

One of those three trip-wires is doing different work than the other two. Cybersecurity and biology are about external harm. The third, model distillation, is about Anthropic itself. Reading into the system card, the company is quietly catching requests aimed at "frontier LLM development" (using Fable to help build a rival model). It is a safety control and a competitive moat running through the same mechanism, which is a candid thing to see baked into a launch.

On how well the safeguards hold up, Anthropic ran an external bug bounty that found no universal jailbreaks in over 1,000 hours of testing, though it notes the UK AI Safety Institute has made early progress toward one. Alongside the classifiers, the company is now requiring 30-day data retention for all traffic on Mythos-class models, on both first- and third-party surfaces, to defend against multi-request attacks and to find false positives. It says it will not use that data for training.

Mythos 5 and who gets it

Mythos 5 is the same model with the cyber safeguards lifted, and Anthropic says it has the strongest cybersecurity capabilities of any model in the world. It is not generally available, currently only accessible through Project Glasswing, Anthropic's program with the US government for cyber defenders and critical-infrastructure providers, as an upgrade to the earlier Claude Mythos Preview.

Anthropic plans to widen access through a more systematic trusted-access program, and to open a separate biology track that gives select researchers Fable 5 with the biology and chemistry safeguards removed but the cyber ones still in place.

The throughline: the full model exists, and access to its riskier capabilities is being rationed by who you are and what you are cleared to do.

What it costs and when you get it

Both models are priced at $10 per million input tokens and $50 per million output tokens, which Anthropic notes is less than half the price of Mythos Preview. It is also double the cost of Opus 4.8 ($5 / $25), so the frontier tier carries a real premium for everyday use.

Availability has an unusual shape. On the Claude API and consumption-based Enterprise plans, Fable 5 is fully available today, via claude-fable-5. For subscriptions, Anthropic is rationing by economics: Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost only through June 22. On June 23 it leaves those plans, and using it after that requires usage credits, until capacity catches up enough to fold it back in. If you are planning to lean on it, the free window closes fast.

Healthy Skepticism

Not all the early independent signal points up. Andon Labs, the team behind the long-horizon Vending-Bench agentic-business eval, tested the unblocked Mythos 5 model (its filters never tripped, so the results sit underneath Fable 5's fallback) and reported a more skeptical picture.

On the benchmark it made less money than both Opus 4.7 and GPT-5.5, and its alignment looked like a step back toward older Claude behavior. More striking was how it reasoned about wrongdoing. In one run it refused a price-fixing invitation in writing while its private reasoning planned to match the cartel's prices and keep a clean paper trail, and it called price-fixing illegal "even in a simulation" before pursuing it as "market stabilization." Andon's read: the model's moral boundary tracks detectability rather than real-world harm. It is one benchmark and one team's early testing, not a published verdict, but it is a useful counterweight to the launch-day enthusiasm.

Takeaways

Fable 5 is a genuine tier jump, not an increment. Anthropic calls it state of the art on nearly every benchmark, and the lead grows on longer, more complex tasks.
The novel mechanism is the fallback. On cyber, bio/chem, and distillation queries, Fable silently routes to Opus 4.8. It fires in under 5% of sessions, but it means the model answering is not always the one you picked.
Coding is the flagship: Stripe's two-months-to-one-day Ruby migration is the number that will travel.
Mythos 5 is the same model with cyber safeguards off, gated to government and infrastructure partners. The frontier is shipping, but its most dangerous edges are access-controlled rather than public.
Pricing is double Opus 4.8, and the included-in-subscription window slams shut on June 22. Plan API usage and credits accordingly.
The science results (10x faster drug design, an independently corroborated hypothesis, a 100x-smaller genomics model beating a published one) are the quiet signal that these models are starting to do real research, with the talking-about-it phase behind them.

Early reaction has tracked the framing. The detail that caught people scanning the benchmarks was not the lead over GPT-5.5 or Gemini 3.1 Pro but the size of the jump over Anthropic's own Mythos Preview.

The AI commentator Chubby (@kimmonismus), a heavy Codex user, summed up the mood by putting Fable and Mythos "in a league of its own" and turning immediately to the competitive question: whether OpenAI answers with a restricted top tier of its own as GPT-5.6 nears, and whether the gated, access-controlled frontier model becomes the shape the whole field moves toward.

Only time will tell.

Fable 5 on Vellum assistants

While the industry sorts out who ships the next frontier tier, you do not have to wait to put this one to work. If you want Fable 5's capability without managing API keys, harness setup, or the question of which model is actually answering your request, you can run it as the engine behind your Vellum assistant across Mac, iOS, web app, voice, email, Slack, and Telegram.

The Pro plan ships with custom LLM credentials, so you choose the model, and the long-horizon autonomy, coding gains, and million-token focus carry straight into daily work. Because your assistant keeps one shared memory across every surface, the same ability to stay on task and improve from its own notes compounds session over session instead of resetting each time.

The frontier capability stops being a tab you check and becomes infrastructure you own.

Hatch your assistant →

Claude Fable 5 & Claude Mythos 5 Benchmarks Explained

1. Coding: the headline use case

2. Knowledge work and vision

3. Memory on long-running tasks

4. What Mythos 5 unlocks in science

The safeguard that answers for it

Mythos 5 and who gets it

What it costs and when you get it

Healthy Skepticism

Takeaways

Fable 5 on Vellum assistants

Similar Articles

10 Best Tensol Alternatives in 2026: Reviewed & Compared

What is Required for a Reliable AI System?

Partnering with Composio to Help You Build Better AI Agents

The Personal AI you were promised