Model Comparisons
Analysis: OpenAI o1 vs GPT-4o
Sep 12, 2024
Co-authors:
Anita Kirkovska
Founding GenAI Growth
Akash Sharma
Co-founder & CEO
Table of Contents

Today, OpenAI announced that they’ve reached a new level of AI capability — and they’re reseting the counter back to 1. They shipped their latest models: OpenAI o1 and OpenAI o1 mini.

Built to handle hard problems — they take more time to think before responding, similar to how a person would approach a difficult task. The “OpenAI o1” model, specifically, shows incredible results for various hard problems: math, coding, and reasoning. The model is also able to fight against “jailbreaks” — almost 4 times better than GPT-4o.

This next level of reasoning will have an impact of many industries, from genomics, economics, cognition, to quantum physics — it’s that powerful.

The smaller version, the “OpenAI o1 mini” is a specifically designed model for developers. It excels at accurately generating and debugging complex code — and it’s 80% cheaper than OpenAI o1.

Now that we got the TLDR; let’s go into our analysis where we compare these models on three tasks, and we cover some of the latest benchmarks and human-expert reviews.

Results

To measure the impact of this new AI capability, we analyzed and compared the OpenAI o1 and GPT-4o models on these tasks:

  • Reasoning riddles
  • Math Equations
  • Customer tickets classification

For these specific tasks, we've learned that:

  • Reasoning riddles: Compared to GPT-4o, OpenAI o1 only got one more example correct (12/16).The results between the models were similar, but we noticed that both GPT-4o and OpenAI o1 struggled with analogy reasoning riddles. However, the o1 model performed better on riddles that required more calculations. We threw in GPT-4o and OpenAI o1 in there, and they both performed the same, answering 10/16 questions correctly. With a major difference — GPT4o solved them all in a split second.
  • Math equations: We used one of the hardest 10 SAT questions for this experiment and OpenAI o1 got 6/10 right — which is really impressive! GPT-4o did poorly on this tasks, and it got only 2 equations right. For good measure, we added Claude 3.5 Sonnet to the mix — but it performed equally bad as GPT-4o.
  • Classification: OpenAI o1 had 12% improvement over GPT-4o on 100 test cases — which is great improvement for this task. It has the best precision (83%), recall and overall f1 score — so if your classification task is not sensitive to latency you should go with OpenAI o1.

We can conclude that:

  • If you want extremely fast model, at a lower cost — just go with GPT-4o mini.
  • If you want the most capable model for production — go with GPT-4o.
  • If you want to solve some extremely hard problems (especially in math!) and you don’t care about latency — go with OpenAI o1.

💡If you're looking to evaluate these models on your own task - Vellum can help. Book a call with one of our AI experts to set up your evaluation.

Important Observations

1) Productionizing any feature built on top of the OpenAI o1 model is going to be very hard

The thinking step can take a long time (I waited more than 3 minutes for some answers!), and we can’t determine how long either. OpenAI is hiding the actual CoT process, and they only provide a summary of it — so there is no good way for us to measure how long a given output will take to generate and/or understand how the model thinks. In some cases I’ve ran the same question x3 with OpenAI o1 and got three different answers. Also, while the reasoning is not visible in the API, the tokens still occupy space in the model's context window and are billed as output tokens — expect to pay for top tier tokens you don’t see.

2) The OpenAI o1 won't need advanced prompting

It seems like you can prompt these models in a very straightforward way. Including more CoT or few-shot examples won't have an impact and in some cases it might hinder the performance.

3) The OpenAI o1 model won’t be useful for many frequent use-cases

While the model is really powerful for solving hard problems, it’s still not equipped with the standard features/parameters that GPT-4o has. They’ve disabled streaming, tool use and other features from the API — so have that in mind when you’re choosing a model for your use-case. Also, the human-expert reviews showed a preference towards GPT-4o for some natural language tasks, which means that this model is not the best choice for every task.

4) Choose the problem and your models wisely

Now, more than ever we need to know which tasks are going to be better solved with “reasoning models” vs “standard models”. For a basic reasoning task, GPT-4o took less than a second to provide the answer, while we waited OpenAI o1 for 2-3 minutes to “think” (more like overthink!) to get to the same answer. In contrast, GPT-4o will be fast to make mistakes — too. Balance will be key.

Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!

Why is the OpenAI o1 Model so Much Better?

To put it simply, the new o1 model is so much better because of two changes:

  • It’s trained with a large-scale reinforcement learning algorithm that teaches the model how to answer queries using chain of thought (read more about CoT here);
  • Then, also, the model takes extra time to think during inference, improving its answers in real time.
Visual that shows how the o1 performance improves with both train-time and test-time compute
the performance improves with both train-time and test-time compute

We covered the Orion and the Strawberry models in this post, but if you want to go deep into the technical details read their system card here.

But now, let’s go into the analysis.

Our Approach

The main focus on this analysis is to compare GPT-4o (gpt-4o 2024-08-06) and the OpenAI o1 model.

We look at standard benchmarks, human-expert reviews, and conduct a set of our own small-scale experiments.

In the next two sections we will over three analysis:

  • Latency and Cost comparison
  • Standard benchmark comparison (example: what is the reported performance for math tasks between GPT-4o vs GPT-4?)
  • Human-expert reviews (OpenAI’s own version of the Chatbot Arena)
  • Three evaluation experiments (math equations, classification and reasoning)

You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between Claude 3.5 Sonnet and GPT-4o.

Latency Comparison

As expected, the new o1 models are much slower, due to their “reasoning” process.

OpenAI o1 is approximately 30 times slower than GPT-4o. Similarly, the o1 mini version is around 16 times slower than GPT-4o mini.

Cost Comparison

When it comes to cost, OpenAI o1 and o1 mini are one of the most expensive models on the market right now for two reasons: 1) their initial input/output tokens cost a lot, and 2) you also get charged for the hidden CoT tokens that you don't see in the output.

The cost for 1M tokens for OpenAI o1 is $15 for 1M input tokens, and $60.00 for 1M output tokens — making it 3 times more expensive than GPT-4o. Similarly, OpenAI o1 mini, costs $3.00 for 1M input tokens and $12.00 for 1M output tokens — making it 20 times more expensive than GPT-4o mini.

So, unless you’re dealing with very hard problems where the extra performance of o1 or o1 mini is necessary, it’s hard to justify the significant cost difference, especially when GPT-4o provides similar capabilities at a fraction of the price.

Reported Capabilities

When new models are released, we learn about their capabilities from benchmark data reported in the technical reports. The new OpenAI o1 model improves on the most complex reasoning benchmarks:

  • Exceeds human PhD-level accuracy on challenging benchmark tasks in physics, chemistry, and biology on the GPQA benchmark
  • Coding is easier — It ranks in the 89th percentile on competitive programming questions (Codeforces)
  • It’s also very good at math — In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Now, this is next level.

On the standard ML benchmarks, it has huge improvements across the board:

More statistics from Chatbot Arena (ELO Leaderboard)

This public ELO leaderboard is part of the LMSYS Chatbot Arena. The chatbot arena allows you to prompt two anonymous language models, vote on the best response, and then reveal their identities.

They’ve gathered over 6,000 votes, and the results show that the OpenAI o1 model is consistently ranked #1 across all categories, with Math being the most notable area of impact. The o1-mini model is #1 in technical areas, #2 overall. Check out the full results on this link.

Image
The o1-preview and o-1 mini outrank all models for Math tasks

Human Expert Reviews

OpenAI also brought in human experts to review and compare the new model with GPT-4o, without knowing which model they were evaluating.

The results show that the newest model is great at complex tasks, but not preferred for some natural language tasks — suggesting that maybe the model is not the best for every use-case.

Task 1: Math Equations

For this task, we’ll compare the GPT-4o and OpenAI o1 models on how well they solve some of the hardest SAT math questions. This is the basic prompt that we used for both models:

You are a helpful assistant who is the best at solving math equations. You must output only the answer, without explanations. When given a list of answers, output the letter for example "C)" followed by the answer. Only output words and numbers, no commas or semicolons. Always start with a capitalized letter.

To write this prompt we used Vellum’s Prompt IDE to test for answer structure and validity. We didn’t want the model to output extra characters or explanations before we ran the large evaluation set with all 15 SAT math questions — so we took some time refining the prompt:

Preview of the Prompt Engineering IDE with two models and one scenario (math equation) on the left.

Then we ran all test cases (10 math equations) in Vellum Evaluations:

Here’s what we found out

  • GPT-4o did poorly on most of the examples, and only got 2 right. OpenAI o1 got 6 of them right — which is impressive.
  • We put Claude 3.5 Sonnet in the mix for good measure, and it performed just as poorly as GPT-4o, with identical results.

Winner: OpenAI o1!

Task 2: Reasoning

GPT-4o is the best model for reasoning tasks — as we can see from standard benchmarks and independently ran evaluations.

Will OpenAI o1 take the #1 position?

To find out, we selected 16 verbal reasoning questions to compare the two. Here is an example riddle and its sources:


Then we ran the evaluation across all cases:

From the image above we can see that:

  • OpenAI o1 got only one more example correct than GPT-4o. While not a great difference, we can see some improvements.
  • The examples were related to mathematical and distance calculations, and given the increased math capabilities of the o1 model — this is somewhat expected.
  • We were also curious to see how the “mini” models would perform — GPT-4o Mini (63%) outperformed OpenAI o1 Mini (54%) and did so in less time.

Winner: For the 16-question set we tested, it’s essentially a tie.

Task 3: Classification

In this analysis, we had both OpenAI o1 and GPT-4o determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.

We ran the evaluation to test if the models' outputs matched our ground truth data for 100 labeled test cases.

You can see the results we got here:

Here’s what we can observe here:

  • OpenAI o1 did significantly better here, given that we ran 100 test cases. 12% improvement here is amazing.

For classification tasks, accuracy is important but not the only metric to consider, especially in contexts where false positives (incorrectly marking unresolved tickets as resolved) can lead to customer dissatisfaction.

So, we calculated the precision, recall and f1 score for these models:

Responsive Table
GPT-4o OpenAI o1
Accuracy 62% 75%
FP 6 8
TP 24 39
FN 32 17
Precision 80% 82.98%
Recall 42.86% 69.64%
F1 Score 55.81% 75.73%

Winner: For this task, OpenAI o1 clearly wins. It has higher  precision, accuracy and recall. If your classification task is not sensitive to latency consider using OpenAI o1.

Conclusion

Although OpenAI o1 shines in many areas, its high latency and closed system can be real barriers for production use. For most cases, GPT-4o is still the go-to model, while OpenAI o1 is better suited for solving tough problems behind the scenes, rather than being the first choice for everyday production needs.

To try Vellum and evaluate these models on your tasks, book a demo here.

ABOUT THE AUTHOR

ABOUT THE AUTHOR
Anita Kirkovska
Founding GenAI Growth

Anita Kirkovska, is currently leading Growth and Content Marketing at Vellum. She is a technical marketer, with an engineering background and a sharp acumen for scaling startups. She has helped SaaS startups scale and had a successful exit from an ML company. Anita writes a lot of content on generative AI to educate business founders on best practices in the field.

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect
Related Posts
View More

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.