Claude Opus 4.8 Benchmarks Explained

May 28, 2026·6 min·By Nicolas Zeeb

Best Practices

Anthropic released Claude Opus 4.8 on May 28, 2026. The framing is modest. Anthropic itself calls it "a modest but tangible improvement" over Opus 4.7, and the pricing is unchanged at $5 per million input tokens and $25 per million output tokens. The benchmark numbers tell a more interesting story than the framing suggests.

Here's a walk through each headline category Anthropic published, what the scores actually mean, and the result buried below the table that probably matters most.

1. Coding: SWE-Bench Pro

SWE-Bench Pro is the hardest of the SWE-bench variants. Problems come from actively-maintained repositories with multi-file diffs and no public ground-truth leakage. It's the closest thing the field has to a coding benchmark that resists memorization.

SWE-Bench Pro

Pass rate, % (higher is better)

Opus 4.8

69.2

Opus 4.7

64.3

GPT-5.5

58.6

Gemini 3.1 Pro

54.2

Source: Claude Opus 4.8 System Card, Table 8.1.A.

Opus 4.8 lands at 69.2%, almost 5 points clear of Opus 4.7 (64.3%) and over 10 points ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%). On SWE-bench Verified (the original 500-problem set) Opus 4.8 hits 88.6% vs Opus 4.7's 87.6% and Gemini 3.1 Pro's 80.6%. The harder the variant, the bigger the gap.

Michael Truell at Cursor adds context that doesn't fit on a leaderboard: on CursorBench, Opus 4.8 uses fewer steps for the same intelligence, meaning the token-per-task cost drops without sacrificing pass rates. Scott Wu at Cognition reports the model "fixes the comment-verbosity and tool-calling issues we saw with Opus 4.7." The headline number is the gap on Pro. The shape of the upgrade is the same answer at lower cost.

2. Terminal work: Terminal-Bench 2.1

Terminal-Bench 2.1 is where benchmark methodology starts mattering as much as model capability.

Terminal-Bench 2.1

Mean reward, % (higher is better)

GPT-5.5 (Codex CLI)*

83.4

GPT-5.5 (Terminus-2)

78.2

Opus 4.8 (Terminus-2)

74.6

Gemini 3.1 Pro (Terminus-2)

70.3

Opus 4.7 (Terminus-2)

66.1

*Terminal-Bench is sensitive to harness choice. GPT-5.5's headline 83.4% uses OpenAI's Codex CLI. Apples-to-apples on the public Terminus-2 harness, Opus 4.8 sits closer to GPT-5.5 and well above Opus 4.7 and Gemini 3.1 Pro. Source: Anthropic system card.

This is the standard apples-vs-oranges problem with frontier coding evals. The harness matters as much as the model, and labs usually choose the harness that flatters their own model. Anthropic disclosed both numbers, which is more than the eval field usually gets. The 8-point Opus 4.7 → Opus 4.8 jump on the same harness is the more honest read.

3. Hard reasoning: Humanity's Last Exam

Humanity's Last Exam (HLE) is the hardest general-knowledge reasoning benchmark in regular rotation. Anthropic reports two configurations: with tools and without.

Without tools, Opus 4.8 leads at 49.8% versus Opus 4.7 (46.9%), Gemini 3.1 Pro (44.4%), and GPT-5.5 (41.4%). With tools, the gap widens.

Humanity's Last Exam (with tools)

Accuracy, % (higher is better)

Opus 4.8

57.9

Opus 4.7

54.7

GPT-5.5

52.2

Gemini 3.1 Pro

51.4

Source: Claude Opus 4.8 System Card, Table 8.1.A.

GPQA Diamond is a useful counter-data point. Opus 4.8 lands at 93.6, slightly below Opus 4.7 (94.2) and Gemini 3.1 Pro (94.3). All three are statistically tied. GPQA is a benchmark the field has effectively beaten. HLE is where new headroom still exists, and Opus 4.8 takes the top of that range.

4. Computer use: OSWorld-Verified

OSWorld-Verified evaluates an agent's ability to complete real-world computer tasks across editing documents, browsing the web, and managing files on a live Ubuntu VM.

OSWorld-Verified

Pass@1, % (higher is better)

Opus 4.8

83.4

Opus 4.7

82.8

GPT-5.5

78.7

Gemini 3.1 Pro

76.2

Source: Claude Opus 4.8 System Card, Table 8.1.A. Opus 4.7 score updated to 82.8 after a bug fix on the zoom tool and a raise of max tokens per turn from 16K to 128K.

The cleaner result lives outside the system card table. Miguel Gonzalez at Browserbase reports Opus 4.8 scoring 84% on Online-Mind2Web, which he calls "a meaningful jump over both Opus 4.7 and GPT-5.5." Browser-agent reliability isn't incidental to this release. Anthropic shipped "dynamic workflows" the same day, where Claude Code plans work and runs hundreds of parallel subagents in a single session. The OSWorld and Online-Mind2Web jumps are the prerequisite for that feature being useful, not a vanity score.

A footnote worth surfacing: Anthropic restated the Opus 4.7 OSWorld-Verified score from a prior baseline up to 82.8% after a zoom-tool bug fix. Read generously, it's methodology cleanup. Read skeptically, it makes the 4.7-to-4.8 delta look smaller than it was on the original scoring. Either way, it's flagged transparently.

5. Professional work: GDPval-AA

GDPval-AA is the benchmark with the most dramatic spread on the table. It measures real-world economically valuable knowledge work across professional domains, and Opus 4.8 doesn't just lead, it leads by 576 points over Gemini 3.1 Pro, the largest gap between top and bottom on any benchmark Anthropic published.

GDPval-AA

Aggregate score (higher is better, scale to 2000)

Opus 4.8

1,890

GPT-5.5

1,769

Opus 4.7

1,753

Gemini 3.1 Pro

1,314

Source: Claude Opus 4.8 System Card, Table 8.1.A. GDPval-AA measures economically valuable knowledge work across professional domains.

Two patterns to read off this chart. First, the frontier-class cluster (Opus 4.8, GPT-5.5, Opus 4.7) is tight: 137 points spread across the top three. Second, Gemini 3.1 Pro drops off a cliff. For workflows priced on the quality of professional output, the choice between Opus 4.8 and Opus 4.7 is incremental. The choice between either Anthropic model and Gemini 3.1 Pro is structural.

This is also the benchmark category where the Harvey customer testimonial lands. Harvey reports Opus 4.8 is the first model to break 10% overall on their Legal Agent Benchmark at the all-pass standard. Breaking 10% sounds small until you remember the all-pass standard requires the model to complete every sub-task in a multi-step legal workflow correctly. Niko Grupen at Harvey frames it in customer terms: it changes how much real attorney work clients can hand off.

6. Financial analysis: Finance Agent v2

For finance, the picture gets more complicated.

Finance Agent v2

Pass rate, % (evaluated by Vals AI)

Gemini 3.5 Flash

57.9

Opus 4.8

53.9

GPT-5.5

51.8

Opus 4.7

51.5

Gemini 3.1 Pro

43.0

Source: Claude Opus 4.8 System Card, Table 8.1.A. Gemini 3.5 Flash is a smaller, faster model than Opus 4.8 and GPT-5.5.

Gemini 3.5 Flash leading Finance Agent v2 at 57.9% is the entry that doesn't fit the "Opus on top" narrative. Worth flagging directly: a smaller, cheaper model is the leaderboard winner here. Opus 4.8 still leads the frontier-class field (53.9% vs GPT-5.5's 51.8% and Gemini 3.1 Pro's 43.0%), and Anthropic reports Databricks Genie now reasons over PDFs at 61% cheaper token cost than Opus 4.7. For enterprise workflows priced on token spend, that's the upgrade that shows up in invoices.

Honesty caught up to capability

The finding that probably matters most doesn't sit anywhere on the headline chart. Anthropic reports that Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.

This is the failure mode every Cursor user, every Devin user, every Claude Code user has watched happen in real time. The model claims the task is done, the test suite hasn't been run, the edge case hasn't been considered, and you only catch it because something felt off. Cutting that rate by 4x is a bigger productivity unlock than any single benchmark point.

Anthropic's Alignment team also concluded that Opus 4.8 "reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user's best interest," and that misalignment rates are "similar to our best-aligned model, Claude Mythos Preview." That's the first time a generally-available Claude has been benchmarked at Mythos-class alignment levels. Mythos itself remains gated behind Project Glasswing. Anthropic confirms Mythos-class models for general release "in the coming weeks."

Cheaper, faster, more controllable

Three structural changes landed alongside Opus 4.8 that won't show up on any leaderboard:

Fast mode is 3x cheaper. Opus 4.8 fast mode runs at 2.5× speed for $10/$50 per million input/output tokens, a third of what fast mode cost on prior Opus models.
Effort control is now a user-facing dial. Default is "high." Users can pick "extra" (xhigh in Claude Code) or "max," and the model spends more tokens to chase better answers. Most reported scores are at default effort. Anthropic notes the higher tiers improve quality further.
The Messages API accepts system entries inside the messages array. Developer-only change with the largest agent-builder impact. You can now update Claude's instructions mid-task without breaking the prompt cache or routing the update through a user turn. Permissions, token budgets, environment context: all updatable inline.

Takeaways

Opus 4.8 leads on five of six headline benchmarks. The one it loses (Finance Agent v2) goes to Gemini 3.5 Flash, a smaller and cheaper model. Smaller models keep winning specific verticals.
The biggest spread on the table is GDPval-AA: a 576-point gap between Opus 4.8 (1,890) and Gemini 3.1 Pro (1,314). For knowledge-work-heavy applications, the model choice is structural, not incremental.
Terminal-Bench shows GPT-5.5 ahead on its own harness and Opus 4.8 ahead on the public harness. Trust the like-for-like, not the headline number.
GPQA Diamond is saturated. Opus 4.8 (93.6), Opus 4.7 (94.2), and Gemini 3.1 Pro (94.3) are statistically tied at the top. HLE with tools is where headroom still exists, and Opus 4.8 widens the gap to 57.9%.
The 4x honesty improvement will change day-to-day developer experience more than any single benchmark point. It's also the result hardest to communicate in a chart.
Mythos-class alignment is no longer Mythos-exclusive. Opus 4.8 matches Mythos Preview's alignment numbers, which is either a quiet capability leak or a sign that alignment work generalizes faster than capability work. We'll know which when Mythos-class models actually ship.

You can run Opus 4.8 as the model behind your Vellum assistant across Mac, iOS, web, Slack, and Telegram. The Pro plan ships with custom LLM credentials, so the new fast-mode pricing, the GDPval lead, and the 4x honesty improvement carry over to your assistant's daily work without you having to manage API keys or harness setup.

Hatch your assistant →