Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

GPT-4o Mini v/s Claude 3 Haiku v/s GPT-3.5 Turbo: A Comparison

A comparison between the latest low cost, low latency models

Written by
Reviewed by

New day, new model! Today, OpenAI released “GPT-4o Mini”, their latest cost-efficient small model. At 128k tokens the context window is 8x larger than GPT-3.5 Turbo. OpenAI is suggesting developers use GPT-4o Mini where they would have previously used GPT-3.5 Turbo as this model is multimodal, performs better on benchmarks and is more than 60% cheaper.

We see a pattern emerging of model providers announcing models across “weight classes”, Anthropic has 3 different Claude 3 models and OpenAI has 2 GPT-4o models. In this comparison article, we’ll answer the following questions:

  • Does GPT-4o Mini really perform better than GPT-3.5 Turbo for my tasks?
  • What’s the best small model currently on the market? Claude 3 Haiku or an OpenAI model?

To compare GPT-4o Mini, GPT-3.5 Turbo and Claude 3 Haiku on specific tasks, we evaluated them across 3 different tasks:

  • Data extraction from legal contracts
  • Customer tickets classification
  • Verbal reasoning

For these specific tasks, we learned that:

  • Data Extraction: GPT-4o Mini performs worse than GPT-3.5 Turbo and Claude 3 Haiku, sometimes missing the mark entirely. All models don’t have high enough quality for this task (only 60-70% accuracy)
  • Classification: Highest precision for GPT-4o (88.89%), making it the best choice to avoid False Positives. Balanced F1 Score between GPT-4o Mini & GPT-3.5 Turbo
  • Verbal Reasoning: GPT-4o Mini outperforms the other models. It doesn’t do well on numerical questions but performs well on relationship / language specific ones.

Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!

Our Approach

We look at standard benchmarks, community-ran data, and conduct a set of our own small-scale experiments.

In the next few sections we cover:

  • Cost comparison
  • Performance comparison (L‍‍a‍‍t‍‍e‍‍n‍‍c‍‍y‍‍,‍‍ ‍‍T‍‍h‍‍r‍‍o‍‍u‍‍g‍‍h‍‍p‍‍u‍‍t‍‍)‍‍‍‍
  • Standard benchmark comparison (example: what is the reported performance for math tasks between GPT-4o mini vs GPT-3.5 vs Claude 3 Haiku?)
  • Three evaluation experiments (data extraction, classification and math reasoning)

You can skip to the section that interests you most using the "Table of Contents" panel on the left or scroll down to explore the full comparison between the models.

Cost Comparison

OpenAI has remained true to its word of continuously pushing costs down and making AI accessible to a large number of people.

Winner = GPT-4o Mini

Performance Comparison

Latency Comparison

Latency, or time to first token, is an important metric to minimize because it helps reduce the perception of how “slow” the model is to respond. The data below measure p50 latency across a dataset and you can see GPT-3.5 Turbo is marginally faster than the other models.

However, latency can be impacted by the load on the API and size of the prompt. Given the low latency of these models overall and the data here only showing median we’d call this a tie between the 3 models.

Result = TIE

Note: lower is better

Throughput Comparison

Throughput, on the other hand, is the number of tokens a model can generate per second once it generates the first token. GPT-4o Mini’s throughput is significantly higher than other models in the market, so for long form output generation (e.g., writing a job description) GPT-4o’s completions will likely be the fastest despite a slower time to first token

Winner = GPT-4o Mini

Note: higher is better

Reported Capabilities

Standard benchmarks

As part of GPT-4o Mini’s launch blog, OpenAI released details about the model’s performance on standard benchmarks:

  • MMLU (Massive Multitask Language Understanding)
  • GPQA (Graduate Level Google-Proof Q&A)
  • DROP (Discrete Reasoning Over Paragraphs)
  • MGSM (Multilingual Grade School Math)
  • MATH (General Mathematics)
  • HumanEval (Code Generation)
  • MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning)
  • MathVista (Visual Math Reasoning)

Here are the main takeaways from this data:

  • GPT-4o Mini performs 2nd best on most benchmarks after GPT-4o. The biggest performance improvements compared to GPT-3.5 Turbo seem to be in Mathematics (MATH and MGSM), which was a common issue with prior generations of LLMs.
  • GPT-4o has newer capabilities in visual math reasoning because of multimodality, this capability was not present in GPT-3.5 Turbo

ELO Leaderboard

For chat completions, evaluated by the public ELO leaderboard, GPT-4o outranks GPT-4 Turbo:

Benchmarks and crowdsourced evals matter, but they don’t tell the whole story. To really know how your AI system performs, you must dive deep and evaluate these models for your use-case.

Now, let’s compare these models on three tasks that might be useful for your project.

Task 1: Data Extraction

For this task, we’ll compare GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku on their ability to extract key pieces of information from legal contracts. Our dataset includes Master Services Agreements (MSAs) between companies and their customers. The contracts vary in length, with some as short as 5 pages and others longer than 50 pages.

In this evaluation we’ll extract a total of 12 fields like Contract Title, Name of Customer, Name of Vendor, details of Termination Clause, whether Force Majeure was present or not etc.

You can check our original prompt and the JSON schema we expected the model to return:

You're a contract reviewer who is working to help review contracts following an Merger & Acquisition deal. Your goal is to analyze the text provided and return key data points, focusing on contract terms, risk, and other characteristics that would be important. You should only use the text provided to return the data.

From the provided text, create valid JSON with the schema:
{
contract_title: string, // the name of the agreement
customer: string, // this is the customer signing the agreement
vendor: string, // this is the vendor who is supplying the services
effective_date: date, // format as m/d/yyyy
initial_term: string, // the length of the agreement (ex. 1 year, 5 years, 18 months, etc.)
extension_renewal_options: string, // are there extension or renewal options in the contract? 
automatic_renewal: string, // is this agreement set to automatically renew? 
termination_clause: string, // the full text in the contract containing information about how to terminate the agreement
termination_notice: string, // the number of days that must be given notice before the agreement can be terminated. only include the number. 
force_majeure: string, // is there a clause for force majeure present in the agreement? 
force_majeure_pandemic: string, // does force majeure include reference to viral outbreaks, pandemics or epidemic events? 
assignment_allowed: string, // is there language specifying whether assignment is allowed? answer in only one sentence.
jurisdiction: string, // the jurisdiction or governing law for the agreement (ex. Montana, Georgia, New York). if this is a state, only answer with the name of the state.
}

Contract:
"""
{{ contract }}
"""

We gathered ground truth data for 10 contracts and used Vellum Evaluations to create 14 custom metrics. These metrics compared our ground truth data with the LLM's output for each parameter in the JSON generated by the model.

Then, we tested GPT-4o Mini, GPT-3.5 Turbo & Claude 3 Haiku using Vellum’s Evaluation suite:

Then we compared how well each model extracted the correct parameters, by looking at the absolute and relative mean values for each entity, using Evaluation Reports:

Here’s what we found across the 14 fields:

  • 7 fields had equal performance across models. For the other 7 fields results were a mixed bag, in most cases there was one model that did worse and the other two were tied at the top:
    • GPT-4o Mini was the worst performing model in 4 fields,
    • GPT-3.5 Turbo was the worst in 2 fields.
    • Claude 3 Haiku was the worst on 1 field.
  • For one of the fields, GPT-4o Mini completely missed the mark and had 20% accuracy compared to 70% for GPT-3.5 Turbo & Claude 3 Haiku
  • From an absolute perspective, this weight class of models don’t provide the desired quality for accurate data extraction. Most fields only had 60-70% accuracy while some were far lower. For a complex data extraction task where accuracy is important pick a more powerful model like GPT-4o or Claude 3.5 Sonnet and use advanced prompting techniques like few-shot or chain of thought prompting.

Winner: Claude 3 Haiku beats GPT-3.5 Turbo marginally, but all models fall short of the mark for data extraction task.

Task 2: Classification

In this evaluation, we had GPT-3.5 Turbo, Claude 3 Haiku and GPT-4o Mini determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.

We ran the evaluation to test if the models' outputs matched our ground truth data for 100 labeled test cases.

In the Evaluation Report below you can see how all models compare to GPT-4o Mini:

We can see from the report that:

  1. Accuracy Comparison: GPT-4o Mini (0.72) does better than Claude 3 Haiku (0.61) and GPT-3.5 Turbo (0.66).
  2. Improvements: Claude 3 Haiku and GPT-3.5 Turbo outperform GPT-4o Mini on 11 completions
  3. Regressions: Claude 3 Haiku and GPT-3.5 Turbo underperform GPT-4o Mini on 22 and 17 completions respectively, adding further evidence that GPT-4o Mini does better at this classification task

Accuracy is important but not the only metric to consider, especially in contexts where false positives (incorrectly marking unresolved tickets as resolved) can lead to customer dissatisfaction.

So, we calculated the precision, recall and F1 score for these models:

Key takeaways:

  • GPT-4o Mini has the highest precision at 88.89%, indicating it is the best at avoiding false positives. This means when GPT-4o Mini classifies a ticket as resolved, it is more likely to be accurate, thus reducing the chance of incorrectly marking unresolved tickets as resolved.
  • Both GPT-4o Mini and GPT-3.5 Turbo have higher F1 scores compared to Claude 3 Haiku

Winner: TIE between GPT-4o Mini and GPT-3.5 Turbo, choice based on preference between Type 1 and Type 2 errors

Note: Keep in mind that prompting techniques can help increase these numbers. We can analyze the misclassified scenarios, and use those insights to prompt the model better. When it comes to AI development it’s all about iterative improvements.

Task 3: Reasoning

The benchmarks released by OpenAI says that GPT-4o Mini is the best model in its weight class on reasoning tasks. Let’s see how it does on our evals. We selected 16 verbal reasoning questions to compare the two. Here is an example riddle:

💡  Verbal reasoning question:

1. Choose the word that best completes the analogy: Feather is to Bird as Scale is to _______.

A) Reptile

B) Dog

C) Fish

D) Plant

Answer: Reptile 

Below is a screenshot on the initial test we ran in our prompt environment in Vellum:

Now, let’s run the evaluation across all 16 reasoning questions:

From the image above we can see that:

  • GPT-4o Mini outperforms the other models with 50% accuracy, versus 44% for GPT-3.5 Turbo and 19% for Claude 3 Haiku.
  • Claude 3 Haiku is often unable to complete its output, better prompt engineering would likely resolve this issue
  • GPT-4o Mini doesn’t do well on numerical questions but performs well on relationship / language specific ones.

Winner: GPT-4o Mini

Summary

Conclusion

While GPT-4o Mini leads in most areas, further evaluation and prompt testing on your specific use case is essential to fully understand the capabilities of these models. Building production-ready AI systems requires careful trade-offs, good prompt curation, and iterative evaluation.

Want to compare these models on your tasks & test cases? Vellum can help! Book a demo here.

Source for throughput & latency: artificialanalysis.ai

Source for standard benchmarks: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

ABOUT THE AUTHOR

ABOUT THE reviewer

Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

lAST UPDATED
Jul 19, 2024
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Q&A RAG Chatbot with Cohere reranking

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

E-commerce shopping agent
Check order status, manage shopping carts and process returns.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.