Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

Claude Opus 4.6 Benchmarks

Explore this breakdown of Claude Opus 4.6 and how it stacks up to Opus 4.5 and OpenAI and Google models.

10 min
Written by
Reviewed by
No items found.

The AI space is absolutely roaring about Anthropic's latest release of Claude Opus 4.6 today. If you know Claude then you know the hype is real! This tweet really says it all…

This upgrade brings meaningful improvements across agentic workflows and reasoning tasks while showing some unexpected trade-offs in certain benchmarks. What makes these powerful upgrades even more exciting is that Opus 4.6 is the first Opus-class model with a 1M token context window. Together, these gains translate into agents that can operate over far larger problems without losing context.

The model is available now through Anthropic's API, major cloud providers, and best of all on Vellum!

💡 Want to see how Claude Opus 4.6 compares to the other leading models for your use case? Compare them in Vellum!

Key observations from benchmarks

While benchmarks are inherently limited and may not fully capture real-world utility, they are our only quantifiable way to measure progress. From the reported data, we can conclude a few things:

  • Agentic capabilities shine: The standout results are in agentic tasks. 65.4% on Terminal-Bench 2.0, 72.7% on OSWorld (computer use), 91.9% on τ2-bench Retail, and a massive 84.0% on BrowseComp search. These represent significant leaps over Opus 4.5 and competing models in practical agent workflows.
  • Novel problem-solving dominance: The 68.8% score on ARC AGI 2 nearly doubles Opus 4.5's 37.6% and crushes Gemini 3 Pro's 45.1%, indicating a major step forward in abstract reasoning capability.
  • Multidisciplinary reasoning leads without tools: 40.0% on Humanity's Last Exam (without tools) beats Opus 4.5 30.8% and Gemini 3 Pro 37.5%, though GPT-5.2 still holds the crown at 50.0% with Pro.
  • Coding trade-off: Interestingly, Opus 4.6 scores 80.8% on SWE-bench Verified, a slight dip from Opus 4.5's 80.9%, suggesting optimization focused elsewhere.
  • Visual reasoning improvements: 73.9% without tools and 77.3% with tools on MMMU Pro shows steady progress, though still trailing GPT-5.2's 79.5%/80.4%.

Coding and Software Engineering

Agentic terminal coding (Terminal-Bench 2.0)

Terminal-Bench evaluates a model's ability to navigate command-line environments, execute shell commands, and perform development operations.

Claude Opus 4.6 scores 65.4% on Terminal-Bench 2.0, a substantial improvement over Opus 4.5 59.8% and ahead of Sonnet 4.5 51.0% and Gemini 3 Pro 56.2%. However, it still trails GPT-5.2's impressive 64.7% (self-reported via Codex CLI). This represents the strongest performance in Anthropic's lineup for command-line proficiency.

Agentic coding (SWE-bench Verified)

SWE-bench Verified tests real-world software engineering by evaluating models on their ability to resolve actual GitHub issues across production codebases.

Opus 4.6 achieves 80.8% on SWE-bench Verified, essentially matching Opus 4.5's 80.9% and GPT-5.2's 80.0%, while outperforming Sonnet 4.5 77.2% and Gemini 3 Pro 76.2%. This near-parity with its predecessor suggests Anthropic prioritized other capabilities in this iteration while maintaining elite coding performance.

Agentic Tool Use and Orchestration

Agentic tool use (τ2-bench)

The τ2-bench evaluates sophisticated tool-calling capabilities across two domains: Retail (consumer scenarios) and Telecom (enterprise support). This benchmark tests multi-step planning and accurate function invocation.

Opus 4.6 achieves remarkable scores: 91.9% on Retail and 99.3% on Telecom. The Retail score surpasses Opus 4.5 88.9%, Sonnet 4.5 86.2%, Gemini 3 Pro 85.3%, and GPT-5.2 82.0%.

On Telecom, it edges out Opus 4.5 98.2% and GPT-5.2 98.7%, while matching Sonnet 4.5 and Gemini 3 Pro both at 98.0%. These results position Opus 4.6 as the strongest model for complex tool orchestration.

Scaled tool use (MCP Atlas)

MCP Atlas tests a model's ability to handle tool use at scale, evaluating performance when coordinating many tools simultaneously. This benchmark is very useful for choosing a model to run an agent with many tools.

Opus 4.6 scores 59.5% on MCP Atlas, falling behind Opus 4.5's 62.3% and GPT-5.2's 60.6%, but ahead of Sonnet 4.5 43.8% and Gemini 3 Pro 54.1%. This dip from the previous version is one of the few areas where Opus 4.6 regresses, suggesting potential trade-offs in how the model handles highly scaled tool coordination.

Computer and Environment Interaction

Agentic computer use (OSWorld)

OSWorld evaluates a model's ability to control computers through GUI interactions, simulating real desktop automation tasks.

Claude Opus 4.6 delivers 72.7% on OSWorld, a significant jump from Opus 4.5's 66.3% and well ahead of Sonnet 4.5's 61.4%. This benchmark wasn't reported for Gemini 3 Pro or GPT-5.2, making direct comparison impossible, but the 6.4 percentage point improvement over its predecessor is notable for practical automation workflows.

Agentic search (BrowseComp)

BrowseComp evaluates web browsing and search capabilities, testing a model's ability to navigate websites, extract information, and complete multi-step research tasks.

Opus 4.6 dominates with 84.0% on BrowseComp, a massive leap from Opus 4.5's 67.8% and crushing Sonnet 4.5's 43.9%. It also beats Gemini 3 Pro 59.2% (Deep Research) and GPT-5.2 Pro 77.9%. This 16.2 percentage point improvement over its predecessor makes Opus 4.6 the clear leader for agentic web research and information gathering.

Reasoning and General Intelligence

Multidisciplinary reasoning (Humanity's Last Exam)

Humanity's Last Exam tests frontier reasoning across diverse academic disciplines, designed to challenge even the most capable models with questions requiring deep understanding and synthesis.

Opus 4.6 scores 40.0% without tools and 53.1% with tools, improving significantly over Opus 4.5's 30.8%/43.4% and Sonnet 4.5's 17.7%/33.6%. It edges out Gemini 3 Pro's 37.5%/45.8% on the without-tools benchmark but trails GPT-5.2's 36.6%/50.0% Pro when tools are enabled. The 9.2 percentage point gain without tools suggests meaningful improvements in core reasoning capacity.

Novel problem-solving (ARC AGI 2)

ARC AGI 2 tests abstract reasoning and pattern recognition on novel problems, designed to measure general intelligence rather than learned knowledge—one of the most challenging benchmarks for current AI systems.

Opus 4.6 scores an impressive 68.8% on ARC AGI 2, nearly doubling Opus 4.5's 37.6% and significantly outperforming Gemini 3 Pro 45.1% (Deep Thinking) and GPT-5.2 Pro 54.2%. This 31.2 percentage point leap represents one of the most dramatic improvements in the release and suggests a fundamental advancement in abstract reasoning capability.

Graduate-level reasoning (GPQA Diamond)

GPQA Diamond evaluates expert-level scientific knowledge across physics, chemistry, and biology with PhD-level questions, testing both domain expertise and reasoning depth.

Opus 4.6 achieves 91.3% on GPQA Diamond, improving over Opus 4.5's 87.0% and Sonnet 4.5's 83.4%, while matching Gemini 3 Pro's 91.9% and trailing GPT-5.2 Pro's 93.2%. While this benchmark is approaching saturation, the 4.3 percentage point gain confirms continued progress in scientific reasoning.

Long Context Capabilities

Long-context retrieval (MRCR v2, needle-in-a-haystack)

Large context windows only matter if a model can reliably retrieve the right information. MRCR v2 measures this by testing a model’s ability to find multiple specific facts buried deep within long inputs.

Opus 4.6 delivers strong long-context retrieval, scoring 93.0% at 256K and 76.0% at 1M context, far outperforming Sonnet 4.5 and demonstrating reliable recall even at extreme context lengths.

GPT-5.2 Thinking shows similarly strong retrieval, achieving 98% on the 4-needle test and 70% on the 8-needle test at 256K, and 85% mean match ratio at 128K. Gemini 3 Pro trails at 77% on the 8-needle benchmark. Together, these results show that Opus 4.6 and GPT-5.2 both pair large context windows with dependable retrieval, while Gemini’s performance degrades more noticeably as context scales.

Multimodal and Visual Reasoning

Visual reasoning (MMMU Pro)

MMMU Pro tests multimodal understanding by requiring models to reason about complex visual information across academic disciplines, evaluating both perception and analytical capabilities.

Opus 4.6 scores 73.9% without tools and 77.3% with tools, improving from Opus 4.5's 70.6%/73.9% and Sonnet 4.5's 63.4%/68.9%. Gemini 3 Pro leads without tools at 81.0%, while GPT-5.2 tops the with-tools category at 80.4%. The gains here are steady but incremental compared to Opus 4.6's leaps in other areas.

Knowledge Work and Domain-Specific Intelligence

Office tasks (GDPVal-AA Elo)

GDPVal-AA measures performance on knowledge work tasks using an Elo rating system, evaluating ability to produce real work products like presentations, spreadsheets, and documents.

Opus 4.6 scores 1606 Elo, ahead of Opus 4.5's 1416, GPT-5.2's 1462, Sonnet 4.5's 1277, and Gemini 3 Pro's 1195. This 190-point improvement over its predecessor indicates significantly better performance on long-horizon professional tasks requiring planning, execution, and coherent output across multiple steps.

Agentic financial analysis (Finance Agent)

The Finance Agent benchmark evaluates performance on realistic financial analysis tasks, including data interpretation, calculation, and financial reasoning.

Opus 4.6 achieves 60.7% on this benchmark, outperforming Opus 4.5 55.9%, Sonnet 4.5 54.2%, GPT-5.2 56.6%, and Gemini 3 Pro 44.1%. This best-in-class result suggests strong practical utility for financial services applications, quantitative analysis, and business intelligence tasks.

Multilingual Understanding

Multilingual Q&A (MMMLU)

MMMLU evaluates multilingual understanding and reasoning across languages, testing whether models maintain reasoning capability beyond English.

Opus 4.6 achieves 91.1% on MMMLU, matching Opus 4.5's 90.8% while trailing Gemini 3 Pro's impressive 91.8% and ahead of Sonnet 4.5's 89.5% and GPT-5.2's 89.6%. This near-parity across the Claude lineup suggests consistent multilingual capabilities across model sizes.

What's new and notable

Agent focused

Opus 4.6's dramatic improvements in computer use (+6.4pp), web search (+16.2pp), and terminal operations (+5.6pp) signal that Anthropic optimized specifically for practical agent deployments. The 84.0% BrowseComp score makes this the go-to model for research agents and information retrieval tasks.

Massive Leap in Abstract reasoning

The 68.8% ARC AGI 2 score—nearly double the previous version—represents one of the largest single-benchmark improvements we've seen in a frontier model update. This isn't just benchmark optimization; it suggests genuine advances in novel problem-solving that should translate to better performance on tasks the model hasn't explicitly been trained for.

MCP Atlas regression

The drop from 62.3% to 59.5% on scaled tool use is one of the few areas where Opus 4.6 steps backward. For teams building agents that coordinate dozens of tools simultaneously, this trade-off matters and may require additional orchestration logic at the application layer.

Real work see strong gains

The 60.7% Finance Agent score and 1606 GDPVal Elo suggest this model excels at the kind of long-horizon, multi-step professional tasks that matter for enterprise deployments—from financial modeling to document generation.

Why this matters for your agents

Opus 4.6 is optimized for the most powerful agents. It’s better at the core tasks agents the best agents actually perform: using computers, running terminals, searching the web, and reasoning across long, multi-step workflows.

If you’re running agents for research, financial analysis, or knowledge work, Opus 4.6 is worth testing now. For large-scale tool orchestration, MCP Atlas is a known trade-off, but for many setups the gains elsewhere will outweigh it.

{{general-cta}}

Extra Resources

ABOUT THE AUTHOR
Nicolas Zeeb
Technical Content Lead

Nick is Vellum’s technical content lead, writing about practical ways to use both voice and text-based agents at work. He has hands-on experience automating repetitive workflows so teams can focus on higher-value work.

ABOUT THE reviewer
David Vargas
Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

No items found.
lAST UPDATED
Feb 6, 2026
share post
Expert verified
Related Posts
LLM basics
February 5, 2026
12 min
15 Best Make Alternatives: Reviewed & Compared
Product Updates
February 3, 2026
5 min
Vellum Product Update | January
LLM basics
January 30, 2026
20 min
15 Best Zapier Alternatives: Reviewed & Compared
LLM basics
January 28, 2026
20 min
2026 Marketer's Guide to AI Agents for Marketing Operations
LLM basics
January 26, 2026
18 min
Top 20 AI Agent Builder Platforms (Complete 2026 Guide)
Product Updates
January 13, 2026
5 min
Introducing Vellum for Agents
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI) = {{roi-cta}}

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
Prior authorization navigator
Automate the prior authorization process for medical claims.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

SEO article generator
Generates SEO optimized articles by researching top results, extracting themes, and writing content ready to publish.
Creative content generator agent
Give it a URL and a format, and it turns the source into finished creative content.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.
Closed-lost deal review agent
Review all deals marked as "Closed lost" in Hubspot and send summary to the team.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

NDA deviation review agent
Reviews NDAs against your standard template, highlights differences, and sends a risk rated summary to Slack.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

No items found.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust center RAG Chatbot
RAG chatbot for internal policy documents with reranking model and Google search.
Customer support agent
Support chatbot that classifies user messages and escalates to a human when needed.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Compliance review agent
Checks DPAs and privacy policies against your compliance checklist then scores coverage and make a plan.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Synthetic Dataset Generator
Generate a synthetic dataset for testing your AI engineered logic.
Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.
Prior authorization review agent
Reviews prior authorization packets, checks them against plan criteria and outputs JSON
Active deals health check agent
Sends a weekly HubSpot deal health update, ranks deals and enables the sales team.
SEO article generator
Generates SEO optimized articles by researching top results, extracting themes, and writing content ready to publish.
Client portfolio review agent
Compiles weekly portfolio summaries from PDFs, highlights performance and risk, builds a Gamma presentation deck.

Build AI agents in minutes for

{{industry_name}}

Roadmap planner
Agent that reviews your roadmap and suggests changes based on team capacity.
Account monitoring agent
Combines product usage data with CRM data from HubSpot or Salesforce to flag accounts with declining usage, especially ahead of renewals.
Cross team status updates
Scans Linear for stale, blocked, or repeatedly reopened issues, flags patterns, and uses Devin to propose cleanup or refactor suggestions.
SEO article generator
Generates SEO optimized articles by researching top results, extracting themes, and writing content ready to publish.
Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.