--- title: "Claude Sonnet 5 Benchmarks Explained" description: "A breakdown of Anthropic's Claude Sonnet 5 benchmarks. Head-to-head scores vs Sonnet 4.6 and Opus 4.8 across coding, terminal, reasoning, computer use, knowledge work, and agentic search, plus the GDPval-AA v2 result where Sonnet 5 actually beats Opus 4.8." canonical_url: "https://www.vellum.ai/blog/claude-sonnet-5-benchmarks-explained" md_url: "https://www.vellum.ai/md/blog/claude-sonnet-5-benchmarks-explained" type: "blog" published_at: "2026-06-30T00:00:00.000Z" read_time: "8 min" category: "LLM basics" featured_image: "/images/og-default.jpg" authors: - "Nicolas Zeeb" --- # Claude Sonnet 5 Benchmarks Explained Anthropic's Claude Sonnet 5 closes the gap to Opus 4.8 on coding, terminal work, reasoning, computer use, and knowledge work. Two benchmarks have Sonnet 5 ahead of Opus 4.8. Here's the full read. Anthropic released [Claude Sonnet 5](https://www.anthropic.com/news/claude-sonnet-5) on June 30, 2026 as "the most agentic Sonnet yet." The headline framing is the closing of a gap: Sonnet 5 sits close to Opus 4.8 in capability, at a fraction of the price. The benchmark numbers back that framing, then complicate it. On one of Anthropic's headline knowledge-work benchmarks, Sonnet 5 actually edges past the model it's supposed to be closing toward. Here's a walk through each benchmark Anthropic published, what the scores say about Sonnet 5 versus its predecessor Sonnet 4.6 and the higher-tier Opus 4.8, and the result that probably matters most for how teams pick models going forward. ## 1. Coding: SWE-Bench Pro SWE-Bench Pro is the harder of the SWE-bench variants. Problems come from actively-maintained repositories with multi-file diffs and no public ground-truth leakage. ```html-render

SWE-Bench Pro

Pass rate, % (higher is better)

Opus 4.8

69.2

Sonnet 5

63.2

Sonnet 4.6

58.1

Source: Claude Sonnet 5 System Card, Anthropic, June 30, 2026.

``` Sonnet 5 lands 5 points ahead of Sonnet 4.6 and 6 points behind Opus 4.8. The 5-point jump over its own predecessor is the more meaningful number for code-heavy applications. Anthropic's launch partner quotes land on this exactly: Yusuke Kaji, GM of AI for Business, reports Sonnet 5 [carries "dozens of our most challenging real pull requests" through to a tested, verified result on its own](https://www.anthropic.com/news/claude-sonnet-5). Dominic Elm, Founding Engineer, frames it as [tracing failures to actual root causes on brownfield code](https://www.anthropic.com/news/claude-sonnet-5) rather than patching symptoms. Cursor's CursorBench tells the same story at the IDE level: Sonnet 5 [scores 57% versus Sonnet 4.6's 49%](https://x.com/cursor_ai/status/2072020786181988418), the largest jump Cursor has reported between adjacent Sonnet releases. ## 2. Terminal work: Terminal-Bench 2.1 Terminal-Bench 2.1 is the eval where Sonnet 5 stops being incremental and starts being structural. ```html-render

Terminal-Bench 2.1

Mean reward, % (higher is better, Terminus-2 harness)

Sonnet 5

80.4

Opus 4.8

74.6

Sonnet 4.6

67.0

Source: Claude Sonnet 5 System Card, Anthropic, June 30, 2026. Sonnet 4.6 score restated to 67.0% on the public Terminus-2 harness.

``` Sonnet 5 scores 80.4% on the same harness where Opus 4.8 sits at 74.6% and Sonnet 4.6 sits at 67.0%. Sonnet 5 doesn't close the gap to Opus 4.8 on terminal work. It moves past it. This is the first benchmark in the suite where the mid-tier model beats the flagship on the same harness. Anthropic doesn't headline this in the launch blog, but the data is in the system card. A 13.4-point jump over Sonnet 4.6 on the same harness is the second-largest delta on the table. Terminal work is where Sonnet-class models were originally supposed to land. Sonnet 3.5 and 3.6 were the first Sonnets Anthropic pitched as good at tool use. Sonnet 5 is the moment that positioning holds against the larger model. ## 3. Hard reasoning: Humanity's Last Exam Humanity's Last Exam (HLE) is the hardest general-knowledge reasoning benchmark in regular rotation. Anthropic reports two configurations: with tools and without. ```html-render

Humanity's Last Exam (with tools)

Accuracy, % (higher is better)

Opus 4.8

57.9

Sonnet 5

57.4

Sonnet 4.6

46.8

Source: Claude Sonnet 5 System Card, Anthropic, June 30, 2026. Sonnet 4.6 HLE score restated to 46.8% after a grader model update.

``` Sonnet 5 lands at 57.4% with tools, basically tied with Opus 4.8 at 57.9%. The gap to Sonnet 4.6 is 10.6 points, the largest Sonnet-to-Sonnet jump Anthropic has published on HLE. Worth flagging the methodology: Anthropic restated the Sonnet 4.6 HLE score to 46.8% (from the older 34.6%-no-tools / higher-with-tools baseline) after updating the grader model. Read generously, this is consistency work. Read skeptically, it makes the Sonnet 4.6 → Sonnet 5 delta look smaller than it was on the original scoring. Either way, Anthropic disclosed the change in the launch post footnotes. Without tools, Sonnet 5's lead over Sonnet 4.6 widens further on the kinds of multi-step questions that benefit from explicit reasoning chains. HLE is also where Sonnet 5's agentic posture shows up: most reported scores are with tool use enabled, and Sonnet 5's gap to Sonnet 4.6 is widest on the tool-enabled runs. ## 4. Computer use: OSWorld-Verified OSWorld-Verified evaluates an agent's ability to complete real-world computer tasks across editing documents, browsing the web, and managing files on a live Ubuntu VM. ```html-render

OSWorld-Verified

Pass@1, % (higher is better)

Opus 4.8

83.4

Sonnet 5

81.2

Sonnet 4.6

78.5

Source: Claude Sonnet 5 System Card, Anthropic, June 30, 2026. Sonnet 4.6 score restated to 78.5% after methodology changes to how OSWorld-Verified is run.

``` Sonnet 5 at 81.2% is 2.2 points behind Opus 4.8 and 2.7 points ahead of Sonnet 4.6. On the cost-performance curve Anthropic published, Sonnet 5 and Opus 4.8 cover a single range with Sonnet 5 cheaper per task and Opus 4.8 more accurate. The 2.2-point delta is what the effort dial is for. The methodology footnote is worth surfacing. Anthropic changed how OSWorld-Verified is run between the Sonnet 4.6 launch and now, and restated Sonnet 4.6 to 78.5%. Anthropic disclosed this in the launch post footnotes rather than burying it, which is more transparency than the eval field usually gets. The 81.2% Sonnet 5 number is on the new harness. Eric He at Pace, an insurance-workflow vendor, frames the practical version of the result: Sonnet 5 [consistently takes the right action and does it quickly](https://www.anthropic.com/news/claude-sonnet-5) on submission intake, FNOL, and loss-run workflows, exactly the category of multi-step browser-and-app work that OSWorld-Verified measures. ## 5. Knowledge work: GDPval-AA v2 GDPval-AA v2 is the benchmark where Anthropic's framing of Sonnet 5 as "close to Opus 4.8" gets turned upside down. ```html-render

GDPval-AA v2

Aggregate score (higher is better, scale to 2000)

Sonnet 5

1,618

Opus 4.8

1,615

Source: Claude Sonnet 5 System Card, Anthropic, June 30, 2026. GDPval-AA v2 measures economically valuable knowledge work across professional domains.

``` Sonnet 5 scores 1,618. Opus 4.8 scores 1,615. Three points apart, with Sonnet 5 ahead. The delta is small enough to be inside the noise band, but Anthropic's framing in the launch blog makes it clear the lab sees this as Sonnet 5 landing at parity on knowledge-work tasks. [The Decoder's coverage](https://the-decoder.com/anthropics-new-claude-sonnet-5-closes-the-gap-to-the-pricier-opus-model-series/) leads with exactly this finding: Sonnet 5 "even edges past Opus 4.8" on GDPval-AA v2. For workflows where the bulk of the work is producing professional output (research memos, analysis, structured documents), Sonnet 5 at Sonnet pricing is now the answer, not Opus. Mauricio Wulfovich at Eve (plaintiff-law AI) reports [the largest practical gains from Sonnet 5 are in legal research and analysis](https://www.anthropic.com/news/claude-sonnet-5), at a price-to-performance ratio that made the migration call obvious. Wulfovich's quote lands where GDPval-AA v2 measures: knowledge work, not coding or computer use. ## 6. Agentic search: BrowseComp Anthropic's launch blog leads with two cost-performance curves: agentic search (BrowseComp) and agentic computer use (OSWorld-Verified). Both plots compare Sonnet 5 against Sonnet 4.6 and Opus 4.8 at multiple effort levels, with cost per task on the x-axis and accuracy on the y-axis. The plot's central claim is that Sonnet 5 and Opus 4.8 now cover a single range. Sonnet 5 cheaper per task at any given accuracy, Opus 4.8 more accurate at any given spend. Between them, users adjust the effort dial to find the right balance. Sonnet 4.6 sat well below this range. Sonnet 5 moves into it. Anthropic's framing of effort is worth reading carefully: Sonnet 5's reported scores are at default effort. The cost-performance curves show Sonnet 5 reaching Opus 4.8 accuracy on BrowseComp at the highest effort level, at roughly one-third the per-task token cost. For agentic search workflows where every task burns tokens, that's the structural shift, not any single benchmark number. ## The Sonnet-Opus gap is mostly gone Six benchmarks. Five of them show Sonnet 5 within a few points of Opus 4.8, and on two (Terminal-Bench 2.1 and GDPval-AA v2) it actually scores higher. The model Anthropic positions as the cheaper alternative to Opus is, on Anthropic's own evaluations, often the better one for the work most teams actually run. This is the structural change the launch blog is selling without quite saying it. The Sonnet-versus-Opus decision used to be capability versus cost. Sonnet 5 makes it accuracy-versus-cost at the margin, with the effort dial doing the rest. For most agentic workloads (coding, terminal, computer use, knowledge work), Sonnet 5 is the model Anthropic is recommending you actually use. Opus 4.8 stays the choice for tasks where the last few accuracy points matter enough to pay the 67% input-price premium and the full xhigh-reasoning-token bill. The launch-day reaction on X landed on that pricing math from the other direction. The pushback wasn't that Sonnet 5 is bad. It was on the cost. ```html-render

0x

0xSero @0xSero· Jun 30, 2026

Replying to @claudeai

Why would I pay more for less?

74.4K views

View on X →

``` The reaction lands harder than it looks. The headline pricing math goes like this: Sonnet 5's introductory rate is $2 per million input tokens and $10 per million output tokens, running through August 31, 2026. After that, Sonnet 5's standard rate is $3/$15 per MTok, identical to Sonnet 4.6's standard rate since February 2026. Opus 4.8 sits at $5/$25 per MTok, identical to Opus 4.7. On a per-token basis Sonnet 5 standard is 40% cheaper on input and 40% cheaper on output than Opus 4.8 standard. The 0xSero critique lands because the per-token math doesn't translate directly to absolute dollars per task. Two things sit underneath it. First, Sonnet 5 uses an updated tokenizer that maps the same input to 1.0–1.35× more tokens depending on content type. At standard rates, Sonnet 5 costs *more* per character of input than Sonnet 4.6 did, not less. The introductory $2/$10 rate is set so the tokenizer change is roughly cost-neutral against Sonnet 4.6 for the 90-day window. Second, the cost-performance curves Anthropic published put Sonnet 5 and Opus 4.8 on the same range, with Sonnet 5 cheaper at any given accuracy at low and medium effort. At xhigh effort, Sonnet 5 reaches Opus 4.8 accuracy but burns substantially more output tokens for reasoning. MarkTechPost's coverage from today: "Best value at low/medium effort; at xhigh it can cost more than Opus 4.8 for similar quality." Anthropic shipped Sonnet 5 to win the middle of the curve, not the ends. The open-source and local-model wave has spent the last year closing the same capability gaps on its own terms, on consumer hardware, at zero per-token cost. Sonnet 5's lead on Terminal-Bench 2.1 and GDPval-AA v2 lands in the same quarter as local Qwen, Mistral, and Llama-tier models running comparable agentic benchmarks from a laptop. The Sonnet-Opus decision is no longer a two-way choice for a meaningful slice of the workflows an assistant actually runs. ## Cyber capability by design The one dimension Sonnet 5 is explicitly not optimized for is cybersecurity capability. Sonnet 5 was not trained on cybersecurity tasks. On the Firefox 147 exploit-development evaluation developed with Mozilla, both Sonnet 5 and Sonnet 4.6 score 0.0% on working exploits, with Sonnet 5 at 13.2% partial control versus Sonnet 4.6's lower partial rate. Opus 4.8 and Mythos 5 are substantially more capable on the same eval. Anthropic launched Sonnet 5 with cyber safeguards enabled by default, matching Opus 4.7 and 4.8. The framing in the launch blog makes the positioning explicit: Sonnet 5 is the agentic workhorse, Opus 4.8 is the option for cybersecurity work that requires reduced guardrails. The US government's block on Mythos 5 and Fable 5 over cybersecurity concerns hangs over the launch context. Sonnet 5 is, in part, the answer to that concern. ## The tokenizer trade One footnote that doesn't fit anywhere on the benchmark charts: Sonnet 5 uses an updated tokenizer (the same kind of change Anthropic introduced with Opus 4.7). The same input maps to roughly 1.0–1.35× more tokens depending on content type. Anthropic set the introductory $2/$10 pricing so the transition is roughly cost-neutral for a Sonnet 4.6 user, even with the token-density shift. After August 31, 2026, pricing moves to $3/$15 per million input/output, Sonnet 4.6's standard rates. At that point, Sonnet 5 costs *more* per character of input than Sonnet 4.6 does today, not less, because the per-token rate is identical and the token count per character is higher. Worth knowing for any team running token-budget forecasts: Sonnet 5 at $3/$15 will cost slightly more per character of input than Sonnet 4.6 did at the same price, because of the tokenizer change. The agentic gains are real, but they aren't free. ## Takeaways - Sonnet 5 beats Sonnet 4.6 across every benchmark Anthropic published. The largest jump is Terminal-Bench 2.1 (+13.4 points on the same harness), the smallest is OSWorld-Verified (+2.7 points). - Sonnet 5 lands within a few points of Opus 4.8 on five of six benchmarks, and ahead of Opus 4.8 on two: Terminal-Bench 2.1 (80.4 vs 74.6) and GDPval-AA v2 (1,618 vs 1,615). - The Sonnet-versus-Opus decision is now accuracy-versus-cost at the margin, with the effort dial as the lever. Sonnet 5 reaches Opus 4.8 accuracy on BrowseComp at the highest effort setting, at roughly one-third the per-task token cost. - Cursor's CursorBench (57% vs Sonnet 4.6's 49%) and the partner quotes land where the system card numbers do: Sonnet 5 is meaningfully better at agentic coding and brownfield debugging, not just at answering trivia. - Cyber capability is the one area Sonnet 5 is explicitly held back. The Firefox 147 eval puts Sonnet 5 at 0% working exploit, 13.2% partial. Opus 4.8 and Mythos 5 are the models for cyber work. - Sonnet 5 ships with an updated tokenizer that increases tokens-per-input by 1.0–1.35×. Introductory pricing through August 31 ($2/$10 per MTok) is roughly cost-neutral for migrating Sonnet 4.6 users. After that, standard Sonnet rates ($3/$15) apply, identical to Sonnet 4.6's standard rates since February. The tokenizer change means Sonnet 5 costs more per character of input than Sonnet 4.6 does today, not less, once the intro period ends. - On per-token pricing, Sonnet 5 standard is 40% cheaper on input and 40% cheaper on output than Opus 4.8 ($5/$25). At low and medium effort, that translates cleanly to cheaper absolute spend. At xhigh effort, Sonnet 5 burns substantially more output tokens for reasoning, and per Anthropic's own framing plus MarkTechPost's coverage from today, can cost more than Opus 4.8 for similar quality. You can run Sonnet 5 as the model behind your Vellum assistant across Mac, iOS, web, Slack, and Telegram. The Sonnet 5 launch lands directly on the workflows an assistant is supposed to handle: multi-step coding, computer use, knowledge work. At the price point where running it as a daily surface stops feeling like a budget decision. The Pro plan ships with custom LLM credentials, so Sonnet 5's agentic gains and the Sonnet-Opus cost-performance dial carry over to your assistant without you having to manage API keys or pick a harness. [Hatch your assistant →](https://vellum.ai) ## Resources - [Introducing Claude Sonnet 5](https://www.anthropic.com/news/claude-sonnet-5): Anthropic launch blog, June 30, 2026. Primary source for the SWE-bench Pro, Terminal-Bench 2.1, HLE, OSWorld-Verified, and GDPval-AA v2 numbers used here. - [Claude Sonnet 5 System Card](https://www.anthropic.com/claude-sonnet-5-system-card): Anthropic, June 30, 2026. Source for the Firefox 147 exploit-development evaluation (Sonnet 5 at 0% working exploit, 13.2% partial control) and the cyber capability positioning. - [Anthropic Claude Sonnet 5 vs Sonnet 4.6 vs Opus 4.8: Agentic Coding Benchmarks, API Pricing, and Cost-Performance Tradeoffs Compared](https://www.marktechpost.com/2026/06/30/anthropic-claude-sonnet-5-vs-sonnet-4-6-vs-opus-4-8-agentic-coding-benchmarks-api-pricing-and-cost-performance-tradeoffs-compared/): MarkTechPost, June 30, 2026. Source for the cost-performance framing at low, medium, high, and xhigh effort, including the line that "best value at low/medium effort; at xhigh it can cost more than Opus 4.8 for similar quality." - [Claude Sonnet 5 launch coverage](https://the-decoder.com/anthropic-launches-claude-sonnet-5/): The Decoder, June 30, 2026. Independent confirmation of the Sonnet-vs-Opus positioning and the Sonnet 5 terminal-benchmark lead. - [Anthropic pricing](https://www.anthropic.com/pricing): Current Sonnet 5, Sonnet 4.6, and Opus 4.8 per-token rates ($2/$10 intro through Aug 31, then $3/$15 standard for Sonnet 5; $3/$15 for Sonnet 4.6; $5/$25 for Opus 4.8). - [Cursor CursorBench](https://cursor.com/blog/cursorbench): Cursor's coding-preference benchmark (Sonnet 5 at 57% vs Sonnet 4.6 at 49%). The source for the partner quote on Sonnet 5's coding preference lead over Sonnet 4.6.