x

LLM Leaderboard

Largest context window: Gemini 1.5 Flash (1M), Claude Models (200K), GPT-4o, GPT-4o mini and Turbo (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), GPT-4o mini ($0.15), Claude 3 Haiku ($0.25)
Top Models per Task
Best in Multitask Reasoning (MMLU)
Top 5 MMLU Models
Best in Coding (Human Eval)
Top 5 HumanEval Models
Best in Math (MATH)
Charts
Fastest and Most Affordable Models
Fastest Models
Charts
Lowest Latency (TTFT)
Latency Chart
Cheapest Models
Price Comparison Chart
Compare Models
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
vs
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Model
Context size
Cutoff date
Input/Output cost
Max output tokens
Latency (TTFT)
Throughput
Llama 3.1 405b
128,000
Dec 2023
$2.7
/
$2.7
4096
0.59s
27 t/s
Llama 3.1 70b
128,000
Dec 2023
$0.52
/
$0.75
4096
0.43s
450 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$5
/
$15
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Llama 3.1 405b
128,000
Dec 2023
$2.7
/
$2.7
4096
0.59s
27 t/s
Llama 3.1 70b
128,000
Dec 2023
$0.52
/
$0.75
4096
0.43s
450 t/s (Cerebras)
Llama 3.1 8b
128,000
Dec 2023
$0.05
/
$0.08
4096
0.32s
~1800 t/s (Cerebras)
Gemini 1.5 Flash
1,000,000
May 2024
$0.075
/
$0.3
4096
1.06S
166 t/s
Gemini 1.5 Pro
2,000,000
May 2024
$3.5
/
$10.5
4096
1.12s
61 t/s
GPT-3.5 Turbo
16,400
Sept 2023
$0.5
/
$1.5
4096
0.37s
84 t/s
GPT-4o mini
128,000
Oct 2023
$0.15
/
$0.6
4096
0.56s
97 t/s
GPT-Turbo
128,000
Dec 2023
$10
/
$30
4096
0.6s
28 t/s
GPT-4o
128,000
Oct 2023
$5
/
$15
4096
0.48s
79 t/s
Claude 3 Haiku
200,000
Apr 2024
$0.25
/
$1.25
4096
0.55s
133 t/s
Claude 3.5 Sonnet
200,000
Apr 2024
$3
/
$15
4096
1.22s
78 t/s
Claude 3 Opus
200,000
Aug 2023
$15
/
$75
4096
1.99s
25 t/s
GPT-4
8192
Dec 2023
$30
/
$60
4096
0.59s
125 t/s
Standard Benchmarks
Dynamic Chart

Model Comparison

Average Multi-choice Qs Reasoning Python coding Future Capabilties Grade school math Math Problems
GPT-4 79.45% 86.40% 95.30% 67% 83.10% 92% 52.90%
GPT-4o - 88.7% - 90.2% - - 76.60%
GPT-4o mini - 82% - 87.00% - - 70.20%
GPT-3.5 65.46% 70% 85.50% 48.10% 66.60% 57.10% 34.1%
Gemini Ultra 79.52% 83.70% 87.80% 74.40% 83.60% 94.40% 53.20%
Gemini 1.5 Pro 80.08% 81.90% 92.50% 71.90% 84% 91.70% 58.50%
Mixtral 8x7B 59.79% 70.60% 84.40% 40.20% 60.76% 74.40% 28.40%
Llama 3 Instruct - 70B 79.23% 82% 87% 81.7% 81.3% 93% 50.4%
Llama 3 Instruct - 8B - 68.40% - 62% 61% 79.60% 30%
Grok 1.5 - 73.00% - 63% - 62.90% 23.90%
Mistral Large - 81.2% 89.2% 45.1% - 81% 45%
Claude 3 Opus 84.83% 86.80% 95.40% 84.90% 86.80% 95.00% 60.10%
Claude 3 Haiku 73.08% 75.20% 85.90% 75.90% 73.70% 88.90% 38.90%
Gemini 1.5 Flash - 78.90% - - 89.20% - 67.70%
GPT-4T 2024-04-09 - 86.5% - - 87.60% - 72.2%
Claude 3.5 Sonnet 88.38% 88.70% 89.00% 92.00% 93.10% 96.40% 71.10%
OpenAI o1 - 92.30% - 92.40% - - 94.80%
OpenAI o1-mini - 85.20% - 92.40% - - 90.00%
* For some models, we don't show an average value because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.

Download the Latest Leaderboard Data

x
🎉
Thanks for joining our newsletter.
Oops! Something went wrong.

Cost and Context Window Comparison

Comparison of context window and cost per 1M tokens.
Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens
GPT-4 8,000 $30.00 $60.00
GPT-4-32k 32,000 $60.00 $120.00
GPT-4 Turbo 128,000 $10.00 $30.00
GPT-3.5 Turbo 16,000 $0.5 $1.5
GPT-3.5 Turbo Instruct 4,000 $1.5 $2.00
Gemini Pro 32,000 $0.125 $0.375
Gemini 1.5 Pro 128,000 $7 $21
Mistral Small 16,000 $2.00 $6.00
Mistral Medium 32,000 $2.7 $8.1
Mistral Large 32,000 $8.00 $24.00
Claude 3 Opus 200,000 $15.00 $75.00
Claude 3 Sonnet 200,000 $3.00 $15.00
Claude 3 Haiku 200,000 $0.25 $1.25
GPT4o 128,000 $5 $15
Gemini 1.5 Flash 1,000,000 $0.35 $0.70
Nemotron 4,000 - -
Llama 3 Models 8,000 - -
Claude 3.5 Sonnet 200,000 $3 $15
GPT-4o mini 128,000 $0.15 $0.60

HumanEval: Coding Leaderboard

Comparison of pre-trained proprietary and open-source models for code generation.
Model HumanEval (0 shot)
GPT-4 67%
GPT-4 Turbo 87.1%
GPT-4o 90.2%
GPT-3.5 48.10%
Gemini Pro 67.70%
Gemini Ultra 74.40%
Gemini 1.5 Pro 71.90%
Mixtral 8x7B 40.20%
Mistral Large 45.1%
Claude 3 Opus 84.90%
Claude 3 Haiku 75.90%
Claude 3 Sonnet 73.00%
Llama 3 70B 75.90%

Sources

This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.

Technical reports

Pricing info