← Back to blog

Claude Opus 4.7 Benchmarks Explained

Anthropic dropped Claude Opus 4.7 today, and the benchmark table tells a focused story. This is not a model that sweeps every leaderboard. Anthropic is explicit that Claude Mythos Preview remains more broadly capable. But for developers building production coding agents and long-running workflows, the improvements are real and well-targeted.

Opus 4.7 is available now across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing stays the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens.

Here's what the numbers actually show.

Key observations of reported benchmarks

Benchmarks are imperfect, but they're the best tool we have for measuring progress. Based on Anthropic's official system card, a few things stand out:

Coding is the clear headline. SWE-bench Verified jumps from 80.8% to 87.6%, a nearly 7-point gain that puts Opus 4.7 ahead of Gemini 3.1 Pro (80.6%). On SWE-bench Pro, the harder multi-language variant, Opus 4.7 goes from 53.4% to 64.3%, leapfrogging both GPT-5.4 (57.7%) and Gemini (54.2%).

Computer use takes a meaningful step. OSWorld-Verified climbs from 72.7% to 78.0%, ahead of GPT-5.4 (75.0%) and within 1.6 points of Mythos Preview (79.6%). Pair that with the 3x vision resolution upgrade and you have a model that's genuinely more capable at real UI interaction.

Tool use is best-in-class. Opus 4.7 leads MCP-Atlas at 77.3%, ahead of Opus 4.6 (75.8%), GPT-5.4 (68.1%), and Gemini 3.1 Pro (73.9%). For teams building tool-calling agents, this is the number that matters most.

Reasoning is strong but not dominant. GPQA Diamond comes in at 94.2%, competitive with Gemini 3.1 Pro (94.3%) and GPT-5.4 Pro (94.4%), and a clear improvement on Opus 4.6 (91.3%). This benchmark is approaching saturation at the frontier, so the gain here matters less than the coding and tool-use improvements.

Agentic search is the one area that slipped. BrowseComp dropped from 83.7% to 79.3%, trailing Gemini 3.1 Pro (85.9%) and GPT-5.4 Pro (89.3%). Worth knowing if your agents rely heavily on web research.

Coding capabilities

Coding benchmarks measure a model's ability to understand, generate, and fix real code across production codebases. SWE-bench tests real-world GitHub issue resolution; SWE-bench Pro raises the bar with multi-language tasks; Terminal-Bench tests command-line proficiency.

SWE-bench Verified

SWE-bench Verified is 500 human-validated GitHub issues that the model must resolve end-to-end. It's the standard benchmark for agentic software engineering.

ModelSWE-bench Verified
Claude Opus 4.787.6%
Claude Mythos Preview93.9%
Claude Opus 4.680.8%
Gemini 3.1 Pro80.6%

A 6.8-point gain over Opus 4.6 puts Opus 4.7 in a clear lead among generally available models. Early-access partners confirmed this in their own internal evals. Cursor reported a jump from 58% to 70% on CursorBench, and one partner saw 13% higher resolution on a 93-task coding benchmark, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve at all.

SWE-bench Pro

SWE-bench Pro tests the full engineering pipeline across four programming languages. It's harder and more industrially relevant than the Verified version.

ModelSWE-bench Pro
Claude Mythos Preview77.8%
Claude Opus 4.764.3%
GPT-5.457.7%
Gemini 3.1 Pro54.2%
Claude Opus 4.653.4%

A 10-point improvement from 53.4% to 64.3% puts Opus 4.7 meaningfully ahead of every currently available competitor on this benchmark.

Terminal-Bench 2.0

Terminal-Bench tests command-line proficiency: navigating shells, executing devops tasks, and debugging in terminal environments.

ModelTerminal-Bench 2.0
Claude Mythos Preview82.0%
GPT-5.475.1%*
Claude Opus 4.769.4%
Gemini 3.1 Pro68.5%
Claude Opus 4.665.4%

*GPT-5.4 score uses a self-reported harness and is not directly comparable.

Opus 4.7 adds 4 points over Opus 4.6 and edges ahead of Gemini. Early-access partner Warp confirmed it passed Terminal-Bench tasks that previous Claude models had failed, including a concurrency bug Opus 4.6 couldn't crack.

Agentic capabilities

Beyond raw coding, Anthropic invested heavily in how Opus 4.7 performs across multi-step tool-calling workflows. This is the category that determines whether an agent can actually run unsupervised.

MCP-Atlas (Scaled tool use)

MCP-Atlas measures performance across complex, multi-turn tool-calling scenarios. It's the closest thing to a real production agent benchmark.

ModelMCP-Atlas
Claude Opus 4.777.3%
Claude Opus 4.675.8%
Gemini 3.1 Pro73.9%
GPT-5.468.1%

This is Opus 4.7's strongest competitive result in the full table. Best-in-class tool use is exactly what you want if you're building orchestration agents that route to multiple tools in a single workflow.

Finance Agent v1.1

Finance Agent measures multi-step financial analysis: planning, tool use, and coherent output across tasks like building financial models and professional presentations.

ModelFinance Agent v1.1
Claude Opus 4.764.4%
GPT-5.4 Pro61.5%
Claude Opus 4.660.1%
Gemini 3.1 Pro59.7%

Opus 4.7 leads this benchmark. Anthropic also calls out state-of-the-art performance on GDPval-AA, a third-party evaluation of knowledge work across finance, legal, and professional domains.

OSWorld-Verified (Computer use)

OSWorld-Verified tests autonomous interaction with desktop software: clicking, navigating, and completing tasks in real GUI environments.

ModelOSWorld-Verified
Claude Mythos Preview79.6%
Claude Opus 4.778.0%
GPT-5.475.0%
Claude Opus 4.672.7%

A 5.3-point gain over Opus 4.6, within 1.6 points of Mythos Preview. Combined with the 3x vision resolution upgrade, this is a meaningful unlock for computer-use agents that need to read dense UIs or interpret screenshots.

BrowseComp measures performance on multi-step web research tasks where the model has to browse, synthesize, and reason across multiple pages.

ModelBrowseComp
GPT-5.4 Pro89.3%
Claude Mythos Preview86.9%
Gemini 3.1 Pro85.9%
Claude Opus 4.683.7%
Claude Opus 4.779.3%

This is the one clear regression in the table. Opus 4.7 dropped 4.4 points from Opus 4.6 on BrowseComp. If your agent workload is heavy on research and web browsing, this is worth factoring in.

Reasoning capabilities

GPQA Diamond (Graduate-level science)

GPQA Diamond tests PhD-level reasoning across physics, chemistry, and biology. This benchmark is approaching saturation — all frontier models are clustered between 91% and 95%.

ModelGPQA Diamond
Claude Mythos Preview94.6%
GPT-5.4 Pro94.4%
Gemini 3.1 Pro94.3%
Claude Opus 4.794.2%
Claude Opus 4.691.3%

A 2.9-point improvement puts Opus 4.7 right in the cluster of frontier leaders. There's no meaningful gap between the top four models here. The real differentiation is in the agentic and coding benchmarks above.

Humanity's Last Exam

HLE tests reasoning at the frontier of human knowledge. It's the hardest multi-modal benchmark currently in use.

ModelHLE (no tools)HLE (with tools)
Claude Mythos Preview56.8%64.7%
GPT-5.4 Pro42.7%58.7%
Claude Opus 4.746.9%54.7%
Gemini 3.1 Pro44.4%51.4%
Claude Opus 4.640.0%53.3%

Opus 4.7 improves on Opus 4.6 in both configurations. With tools it beats Gemini 3.1 Pro (51.4%) and closes toward GPT-5.4 (58.7%). The gap to Mythos Preview (64.7% with tools) is real, but Opus 4.7 is a clear step up from its predecessor.

Multimodal and vision capabilities

CharXiv Reasoning (Visual reasoning)

CharXiv tests scientific figure interpretation: reading and reasoning about charts, graphs, and complex data visualizations.

ModelCharXiv (no tools)CharXiv (with tools)
Claude Mythos Preview86.1%93.2%
Claude Opus 4.782.1%91.0%
Claude Opus 4.669.1%84.7%

A 13-point jump without tools and a 6-point jump with tools. This is the largest single-benchmark improvement in the full table. It maps directly to the underlying model change: Opus 4.7 now accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), more than 3x the resolution of prior Claude models.

One early-access partner testing computer vision for autonomous penetration testing saw visual acuity jump from 54.5% (Opus 4.6) to 98.5%. For any agent that reads dense screenshots, technical diagrams, or data-rich interfaces, this is worth testing immediately.

Multilingual Q&A (MMMLU)

ModelMMMLU
Gemini 3.1 Pro92.6%
Claude Opus 4.791.5%
Claude Opus 4.691.1%

Incremental improvement over Opus 4.6. Gemini leads this benchmark. If multilingual performance is your primary requirement, that's relevant.

Safety and alignment

Opus 4.7 ships with something no prior Opus model has included: production cybersecurity safeguards, tested here before eventually rolling out to Mythos-class models.

Anthropic's automated behavioral audit shows Opus 4.7 is a modest improvement on Opus 4.6 on overall misaligned behavior. It's improved on honesty and resistance to prompt injection attacks, with a small regression on overly detailed harm-reduction advice for controlled substances. Their assessment: "largely well-aligned and trustworthy, though not fully ideal in its behavior."

Claude Mythos Preview remains the best-aligned model Anthropic has trained. Opus 4.7 is explicitly the bridge model where real-world safety mechanisms get tested before broader rollout.

Security professionals doing legitimate cybersecurity work (pen testing, vulnerability research, red-teaming) can apply to Anthropic's new Cyber Verification Program.

What these benchmarks really mean for your agents

The picture from Opus 4.7's benchmark table is clear: Anthropic shipped a focused upgrade, not a broad sweep.

The strongest results are in the places that break production agents. SWE-bench Pro jumping from 53.4% to 64.3% means Opus 4.7 can handle the harder multi-language engineering tasks that Opus 4.6 regularly stumbled on. MCP-Atlas at 77.3%, best-in-class, means it's the top option for multi-tool orchestration workflows. The CharXiv visual reasoning jump from 69.1% to 82.1% without tools, combined with 3x resolution support, means computer-use agents that depend on reading UIs are in materially better shape.

The instruction following improvements that partners reported — fewer tool errors, better follow-through on multi-step tasks, loop resistance — don't show up as a single number in the table, but they show up consistently across partner evaluations. An agent that carries tasks through to completion instead of stopping halfway is qualitatively different from one that doesn't.

One honest note: BrowseComp dropped 4.4 points. If your agent workload is research-heavy, with lots of web browsing and synthesis across multiple pages, Opus 4.7 is a slight step back from Opus 4.6 on that specific task type. GPT-5.4 Pro (89.3%) and Gemini 3.1 Pro (85.9%) are the better options there.

For teams running coding agents, agentic financial workflows, or computer-use automation, this is a real upgrade. The benchmark evidence and the partner feedback say the same thing: Opus 4.7 is the version of Opus that's reliable enough to hand off the hardest work.

When to use Opus 4.6 vs Opus 4.7

The short answer: if you're on Opus 4.6 today and your agents do any meaningful coding, tool use, or visual reasoning, Opus 4.7 is worth upgrading to. But there are a few cases where Opus 4.6 still holds up or where the tradeoff matters.

Upgrade to Opus 4.7 if:

  • Your agents resolve real GitHub issues or work across multi-language codebases. The SWE-bench Pro jump from 53.4% to 64.3% is the biggest coding gain in this generation.
  • You're running multi-tool orchestration. MCP-Atlas at 77.3% leads every available model including GPT-5.4.
  • Your agents need to read screenshots, dashboards, or technical diagrams. The 3x resolution increase and 13-point CharXiv jump are a genuine capability unlock.
  • You're doing agentic financial analysis or professional knowledge work. Finance Agent v1.1 at 64.4% leads all models compared.
  • Instruction following and end-to-end task completion matter. Partners consistently reported fewer tool errors and better follow-through on complex workflows.

Stick with Opus 4.6 (or consider alternatives) if:

  • Your agents rely heavily on deep web research and multi-page synthesis. BrowseComp dropped 4.4 points. GPT-5.4 Pro (89.3%) or Gemini 3.1 Pro (85.9%) are better fits for that specific workload.
  • You have finely tuned prompts optimized for Opus 4.6 behavior. Anthropic notes that Opus 4.7's improved instruction following can cause prompts written for earlier models to produce unexpected results. Re-tuning is advised before switching.
  • Token usage is constrained. Opus 4.7 uses an updated tokenizer that maps the same input to more tokens (roughly 1.0 to 1.35x depending on content type), and it thinks more at higher effort levels. Test token impact on real traffic before migrating production workloads.

Frequently asked questions

Is Claude Opus 4.7 better than GPT-5.4?

It depends on the task. Opus 4.7 leads GPT-5.4 on SWE-bench Verified (87.6% vs no published score), SWE-bench Pro (64.3% vs 57.7%), and MCP-Atlas tool use (77.3% vs 68.1%). GPT-5.4 Pro leads on BrowseComp (89.3% vs 79.3%) and Humanity's Last Exam with tools (58.7% vs 54.7%). For coding and agentic tool use, Opus 4.7 is the stronger choice. For research-heavy workflows, GPT-5.4 Pro has an edge.

Is Claude Opus 4.7 better than Gemini 3.1 Pro?

Generally yes for coding and tool use. Opus 4.7 leads on SWE-bench Pro (64.3% vs 54.2%), SWE-bench Verified (87.6% vs 80.6%), MCP-Atlas (77.3% vs 73.9%), Finance Agent (64.4% vs 59.7%), and Humanity's Last Exam with tools (54.7% vs 51.4%). Gemini leads on BrowseComp (85.9% vs 79.3%) and MMMLU multilingual Q&A (92.6% vs 91.5%).

How much does Claude Opus 4.7 cost?

The pricing is unchanged from Opus 4.6: $5 per million input tokens and $25 per million output tokens. Note that Opus 4.7 uses an updated tokenizer that can increase token counts by roughly 1.0 to 1.35x depending on content type, so your actual spend per task may increase slightly. Anthropic recommends measuring the difference on real traffic.

What is the context window for Claude Opus 4.7?

Based on Anthropic's documentation, Opus 4.7 maintains the same context window as Opus 4.6. Check Anthropic's API docs for the latest limits.

What's new in Claude Opus 4.7 besides the benchmark improvements?

Several things. Vision resolution increased to 2,576 pixels on the long edge (~3.75 megapixels), more than 3x prior Claude models. A new xhigh effort level was added between high and max, giving finer control over the reasoning/latency tradeoff. Task budgets are available in public beta on the API, letting developers guide token spend across longer runs. In Claude Code, a new /ultrareview command runs a dedicated review session that flags bugs and design issues. Auto mode was extended to Max users, allowing longer runs with fewer interruptions.

What is the xhigh effort level?

It's a new effort setting between high and max in Opus 4.7. It gives you more granular control over the tradeoff between depth of reasoning and response latency on hard problems. For coding and agentic use cases, Anthropic recommends starting at high or xhigh. Claude Code now defaults to xhigh for all plans.

What benchmarks does Opus 4.7 lead?

Opus 4.7 leads among currently available (non-preview) models on: SWE-bench Verified (87.6%), SWE-bench Pro (64.3%), MCP-Atlas scaled tool use (77.3%), OSWorld-Verified computer use (78.0%), Finance Agent v1.1 financial analysis (64.4%), and CharXiv visual reasoning (82.1% without tools, 91.0% with tools). Mythos Preview leads most of these categories but is a separate, more restricted model.

Does Claude Opus 4.7 support computer use?

Yes. OSWorld-Verified comes in at 78.0%, a 5.3-point improvement over Opus 4.6 (72.7%) and ahead of GPT-5.4 (75.0%). Combined with the higher resolution vision upgrade, computer-use agents that depend on reading dense UIs or interpreting screenshots will see a meaningful capability improvement.

Should I re-tune my prompts when migrating from Opus 4.6?

Anthropic specifically recommends this. Opus 4.7's improved instruction following means it takes instructions more literally than Opus 4.6 did. Prompts that relied on the older model's loose interpretation or tendency to skip certain instructions may produce different results. Test on representative traffic before switching production workloads.

Is Claude Opus 4.7 the most powerful Claude model?

No. Claude Mythos Preview is Anthropic's most capable model and leads Opus 4.7 on most benchmarks in the comparison table, including SWE-bench Pro (77.8% vs 64.3%), SWE-bench Verified (93.9% vs 87.6%), Terminal-Bench (82.0% vs 69.4%), and GPQA Diamond (94.6% vs 94.2%). Opus 4.7 is the most capable generally available Claude model and is the first to ship with the new cybersecurity safeguards Anthropic is developing ahead of a broader Mythos-class rollout.

Where can I access Claude Opus 4.7?

Opus 4.7 is available through the Anthropic API (model name: claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. It's also available directly in Vellum, either by plugging in your Anthropic API key in Settings under Models & Services, or by logging into a Vellum account and selecting it from the model picker.

Extra resources

Claude Opus 4.6 vs 4.5 benchmarks explained →

Claude Opus 4.5 benchmarks explained →

GPT-5.2 benchmarks explained →

Google Gemini 3 benchmarks explained →

Everything you need to know about Claude Mythos →