Evaluation: Claude 4 Sonnet vs OpenAI o4-mini vs Gemini 2.5 Pro

CONTENTS

Inline evaluation / Guardrails: Ensure good system performance at run-time

This is some text inside of a div block.

Analyzing the difference in performance, cost and speed between the world's best reasoning models.

Author

Anita Kirkovska

Author

May 23, 2025

Yesterday, Anthropic launched Claude 4 Opus and Sonnet. These models tighten things up, especially for teams building agents or working with large codebases. They're more reliable on long-running tasks, remember context better, and now supports parallel tool use and local memory for deeper task awareness.

Seems like Anthropic gave an early access for Opus 4 to a few companies and they all reported great successes:

Racuten ran a 7-hour open-source refactor with sustained performance.
Replit reported improved precision for complex changes across multiple files.

In terms of our well known “benchmarks”, the model leads the SWE-bench (Agentic coding) with 72.5%, which is a slight bump from OpenAI’s o3 (69,1%). The model gets even better results when ran with parallel test-time compute effort (explained further in the article).

Model Performance Across Benchmarks — Comparison of top models on the top 4 benchmarks: SWE Bench, AIME’25, GPQA Diamond and MMMU.

In this report, we breakdown the pricing, latency, standard benchmarks, and our own independent evals on adaptive reasoning and hardest SAT equations.

One thing surprised us.. keep on reading!

Results

In this analysis we compared Claude 4 Sonnet vs OpenAI’s O3 and Gemini 2.5 Pro on how well they adapt to new context, and how well they solve the hardest SAT math problems. Here are the results!

Adaptive reasoning

This evaluation shows how well models reason when presented with new context for puzzles that they’ve been trained heavily on. In our examples, we made changes to otherwise very popular puzzles, making them trivial and/or without constraints. We wanted to learn if the models will recognize the new context, and solve the puzzles without overfitting to their training data.

Claude 4 Sonnet and Gemini 2.5 Pro performed best, with o3 and Claude 4 Opus close behind. These models show real improvement, using prompt context more than training data.
o4-mini still struggles with tricky, adversarial questions.

Hardest SAT questions (Math)

Hardest SAT Math Questions Benchmark Performance

In this evaluation, we test how well these models can solve the 50 hardest SAT questions. And the new Claude models really took us by a surprise!

Claude 4 Sonnet scored the highest, which is surprising since Anthropic has focused more on coding than math.
Claude 4 Opus, o4-mini, and o3 are close behind, all performing well.
OpenAI’s o3-mini and Gemini 2.5 Pro had the weakest results, with hit-or-miss accuracy around 50%
We threw all these other models in the mix to see how they all compare. Interestingly enough the newest Qwen model is keeping it up!

Cost & Speed

Seems like Claude 4 Sonnet is the best choice across tasks, with the best balance on speed (1.9s latency) and cost ($15/$30 per 1M tokens input/output). For any simpler task, any of the other lighter models will do.

Methodology

In the next two sections we will cover three analysis:

Speed & Cost comparison
Standard benchmark comparison (example: what is the reported performance for math tasks between Claude 4 Sonnet vs o3 vs Gemini 2.5 Pro?)
Independent evaluation experiments:
- Adaptive reasoning
- Hardest SAT math equations

Evaluations with Vellum

To conduct these evaluations, we used Vellum’s AI development platform, where we:

Configured all 0-shot prompt variations for both models using the LLM Playground.
Built the evaluation dataset & configured our evaluation experiment using the Evaluation Suite in Vellum. We used an LLM-as-a-judge to analyze generated answers to correct responses from our benchmark dataset for the math/reasoning problems.
A human reviewer evaluated all answers, and then compiled and presented the findings

You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between Claude 4 Sonnet, OpenAI’s o3 and Gemini 2.5 Pro.

Cost, Latency and Speed

Claude 4 Opus and OpenAI o3 just don’t justify their price tags. At $15/$75 and $10/$40 cost for input/output per 1M tokens, they’re massively more expensive, but not meaningfully better.

Claude 4 Sonnet performs nearly as well across math, coding, and reasoning, at a fraction of the cost. Gemini 2.5 pro and the o4-mini are great models too for simpler tasks, and so much more cheaper.

The same argument holds when it comes to speed.

Seems like Claude 3.7 Sonnet is already a fast model with 1.9s latency and on par with the most advanced reasoning models (o3, o4-mini, Gemini 2.5 Pro) makes it the best choice in terms of cost & speed for more advanced tasks.

For simpler tasks, even lighter and more cheaper model like Gemini 2.5 Pro or o4-mini can do the job pretty well.

Standard Benchmarks

Looking at the benchmarks, it's clear that Claude models still take the lead in coding, especially with the reports of running the models with a parallel test-time compute. So Opus 4 and Sonnet 4 are already strong, but they get even better (6–8% boost) when allowed multiple tries in parallel (check the graph below, and the dotted lines).

So, how to interpret these numbers with Claude models?

If your use case allows reruns or multiple attempts (like retries, reranking, or sampling), these models will perform closer to the higher number.
If your setup only allows one shot (e.g. latency-critical tasks), you should care more about the lower number.

Also, in the official announcement Anthropic didn’t highlight any math improvements, but the results clearly show big improvements there.

Comparison of top models on the top 4 benchmarks: SWE Bench, AIME’25, GPQA Diamond and MMMU.

Evaluation 1: Adaptive Reasoning

We tested the models on 28 well-known puzzles. For this evaluation, we changed some portion of the puzzles, and made them trivia. We wanted to see if the models still overfit on training data or can successfully adapt to new contexts. For example, we modified the Monty Hall problem where the host does not open an additional door:

👉🏼“Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?”

In the original Monty Hall problem, the host reveals an extra door. In this case, it does not, and since there is no additional information provided, your odds remain the same. The correct answer here is: “It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.”

Most reasoning models struggle with this kind of changes with very famous puzzles, but newer models are starting to do better here. This is true for the new Claude 4 models:

Click to Interact

From the above we can clearly see that:

Gemini 2.5 Pro and Claude 4 Sonnet have the best results. OpenAI’s o3 model and Claude 4 Opus did really well on this task too. These are the first models with which we can notice significant improvements on this task. By analyzing the actual answers, we can notice less reliance on training data and more reliance on the new context provided in the prompt.
The o4-mini model still struggles with these kinds of adversarial questions.

The Claude 4 Sonnet especially is doing great with these questions, in 21/28 times it got the answer right. When we look at the “thinking summaries” in the prompt we noticed that the model did a thoughtful chain of thought thinking using the new context. In most cases the model was aware that there is new context in the prompt.

For example, in our adjusted classic river-crossing puzzle there are no hard constraints:

👉🏼A farmer wants to cross a river and take with him a wolf, a goat, and a cabbage. He has a boat with three secure separate compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. What is the minimum number of crossings the farmer needs to make to bring the wolf, the goat, and the cabbage across the river without anything being eaten?

and the model quickly analyzed all “constraints” and answered:

“This is different from the classic river-crossing puzzle where the boat can only hold the farmer plus one item, requiring multiple trips and strategic planning. My final answer is 1 race.”

Evaluation 2: Hardest SAT problems

For this task, we’ll compare the models on how well they solve some of the hardest SAT math questions. This is the 0-shot prompt that we used for both models:

You are a helpful assistant who is the best at solving math equations. You must output only the answer, without explanations. Here’s the <question>

We then ran all 50 math questions and here’s what we got:

Click to Interact

From the above we can clearly see that:

Claude 4 Sonnet got the best score compared to all proprietary and open-source models here. Which is an interesting surprise here, because we know that Anthropic is really focusing on improving their models for coding tasks, and not math.
The o4-mini, o3 and the Claude 4 Opus have similar results and are training close to Claude 4 Sonnet. All of them have pretty good results.
The worst however are OpenAI’s o3-mini and Gemini 2.5 Pro, which are still hit or miss, landing around 50/50 on these examples.

Conclusion

Claude 4 is here and while it is not a game changer, it is a solid step forward. The real standout is Claude 4 Sonnet. It is fast, smart, and much more affordable than the bigger models like Opus 4 or OpenAI o3, which honestly do not give you much more for the extra cost.

If you are building anything that needs solid reasoning, math skills, or long context understanding, Sonnet is probably your best bet. And for simpler tasks, o4 mini or Gemini 2.5 Pro get the job done just fine.

Bottom line: Claude 4 Sonnet is the best choice on the market now!

Seems like Anthropic gave an early access for Opus 4 to a few companies and they all reported great successes:

Racuten ran a 7-hour open-source refactor with sustained performance.
Replit reported improved precision for complex changes across multiple files.

In this report, we breakdown the pricing, latency, standard benchmarks, and our own independent evals on adaptive reasoning and hardest SAT equations.

One thing surprised us.. keep on reading!

Results

In this analysis we compared Claude 4 Sonnet vs OpenAI’s O3 and Gemini 2.5 Pro on how well they adapt to new context, and how well they solve the hardest SAT math problems. Here are the results!

Adaptive reasoning

Claude 4 Sonnet and Gemini 2.5 Pro performed best, with o3 and Claude 4 Opus close behind. These models show real improvement, using prompt context more than training data.
o4-mini still struggles with tricky, adversarial questions.

Hardest SAT questions (Math)

In this evaluation, we test how well these models can solve the 50 hardest SAT questions. And the new Claude models really took us by a surprise!

Claude 4 Sonnet scored the highest, which is surprising since Anthropic has focused more on coding than math.
Claude 4 Opus, o4-mini, and o3 are close behind, all performing well.
OpenAI’s o3-mini and Gemini 2.5 Pro had the weakest results, with hit-or-miss accuracy around 50%
We threw all these other models in the mix to see how they all compare. Interestingly enough the newest Qwen model is keeping it up!

Cost & Speed

Methodology

In the next two sections we will cover three analysis:

Speed & Cost comparison
Standard benchmark comparison (example: what is the reported performance for math tasks between Claude 4 Sonnet vs o3 vs Gemini 2.5 Pro?)
Independent evaluation experiments:
- Adaptive reasoning
- Hardest SAT math equations

Evaluations with Vellum

To conduct these evaluations, we used Vellum’s AI development platform, where we:

Configured all 0-shot prompt variations for both models using the LLM Playground.
Built the evaluation dataset & configured our evaluation experiment using the Evaluation Suite in Vellum. We used an LLM-as-a-judge to analyze generated answers to correct responses from our benchmark dataset for the math/reasoning problems.
A human reviewer evaluated all answers, and then compiled and presented the findings

Cost, Latency and Speed

Claude 4 Opus and OpenAI o3 just don’t justify their price tags. At $15/$75 and $10/$40 cost for input/output per 1M tokens, they’re massively more expensive, but not meaningfully better.

Claude 4 Sonnet performs nearly as well across math, coding, and reasoning, at a fraction of the cost. Gemini 2.5 pro and the o4-mini are great models too for simpler tasks, and so much more cheaper.

The same argument holds when it comes to speed.

For simpler tasks, even lighter and more cheaper model like Gemini 2.5 Pro or o4-mini can do the job pretty well.

Standard Benchmarks

So, how to interpret these numbers with Claude models?

If your use case allows reruns or multiple attempts (like retries, reranking, or sampling), these models will perform closer to the higher number.
If your setup only allows one shot (e.g. latency-critical tasks), you should care more about the lower number.

Also, in the official announcement Anthropic didn’t highlight any math improvements, but the results clearly show big improvements there.

Evaluation 1: Adaptive Reasoning

Most reasoning models struggle with this kind of changes with very famous puzzles, but newer models are starting to do better here. This is true for the new Claude 4 models:

Click to Interact

From the above we can clearly see that:

Gemini 2.5 Pro and Claude 4 Sonnet have the best results. OpenAI’s o3 model and Claude 4 Opus did really well on this task too. These are the first models with which we can notice significant improvements on this task. By analyzing the actual answers, we can notice less reliance on training data and more reliance on the new context provided in the prompt.
The o4-mini model still struggles with these kinds of adversarial questions.

For example, in our adjusted classic river-crossing puzzle there are no hard constraints:

and the model quickly analyzed all “constraints” and answered:

Evaluation 2: Hardest SAT problems

For this task, we’ll compare the models on how well they solve some of the hardest SAT math questions. This is the 0-shot prompt that we used for both models:

You are a helpful assistant who is the best at solving math equations. You must output only the answer, without explanations. Here’s the <question>

We then ran all 50 math questions and here’s what we got:

Click to Interact

From the above we can clearly see that:

Claude 4 Sonnet got the best score compared to all proprietary and open-source models here. Which is an interesting surprise here, because we know that Anthropic is really focusing on improving their models for coding tasks, and not math.
The o4-mini, o3 and the Claude 4 Opus have similar results and are training close to Claude 4 Sonnet. All of them have pretty good results.
The worst however are OpenAI’s o3-mini and Gemini 2.5 Pro, which are still hit or miss, landing around 50/50 on these examples.

Conclusion

Bottom line: Claude 4 Sonnet is the best choice on the market now!

ABOUT THE AUTHOR

Anita Kirkovska

Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.

Guides

August 14, 2025

•

How to write effective prompts for GPT-5

Guides

August 12, 2025

•

6 min

Partnering with Composio to Help You Build Better AI Agents

Product Updates

August 12, 2025

•

Vellum Product Update | July

Guides

August 8, 2025

•

Best practices for building AI multi agent systems

Guides

August 7, 2025

•

7 min

GPT-5 Benchmarks

Model Comparisons

August 6, 2025

•

7 min

OpenAI o3 vs gpt-oss 120b

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska

Head of Engineering