Largest context window: Gemini 1.5 Flash (1M), Claude Models (200K), GPT-4o, GPT-4o mini and Turbo (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), GPT-4o mini ($0.15), Claude 3 Haiku ($0.25)
MMLU (5shot)
GPT-4
86.40%
GPT4o
88.70%
GPT-4T 2024-04-09
86.5%
Gemini Ultra
83.70%
Claude 3 Opus
86.80%
Claude 3.5 Sonnet
88.70%
MATH
GPT-4o mini
70.20%
Gemini Flash
67.70%
Claude 3 Opus
60.10%
GPT-4T 2024-04-09
72.2%
GPT4o
76.60%
Claude 3.5 Sonnet
71.10%
HumanEval (0 shot)
GPT-4T 2024-04-09
87.6%
Claude 3 Opus
84.90%
GPT-4o mini
87.20%
GPT4o
90.20%
Claude 3 Haiku
75.90%
Claude 3.5 Sonnet
92.00%
Model Comparison
Average
Multi-choice Qs
Reasoning
Python coding
Future Capabilties
Grade school math
Math Problems
GPT-4
79.45%
86.40%
95.30%
67%
83.10%
92%
52.90%
GPT-3.5
65.46%
70%
85.50%
48.10%
66.60%
57.10%
34.1%
Gemini Pro
68.28%
71.80%
84.70%
67.70%
75%
77.90%
32.60%
Gemini Ultra
79.52%
83.70%
87.80%
74.40%
83.60%
94.40%
53.20%
Gemini 1.5 Pro
80.08%
81.90%
92.50%
71.90%
84%
91.70%
58.50%
Mixtral 8x7B
59.79%
70.60%
84.40%
40.20%
60.76%
74.40%
28.40%
Llama 3 Instruct - 70B
79.23%
82%
87%
81.7%
81.3%
93%
50.4%
Llama 3 Instruct - 8B
-
68.40%
-
62%
61%
79.60%
30%
Grok 1.5
-
73.00%
-
63%
-
62.90%
23.90%
Mistral Large
-
81.2%
89.2%
45.1%
-
81%
45%
Claude 3 Opus
84.83%
86.80%
95.40%
84.90%
86.80%
95.00%
60.10%
Claude 3 Sonnet
76.55%
79.00%
89.00%
73.00%
82.90%
92.30%
43.10%
Claude 3 Haiku
73.08%
75.20%
85.90%
75.90%
73.70%
88.90%
38.90%
GPT4o
-
88.7%
-
90.2%
-
-
76.60%
Gemini 1.5 Flash
-
78.90%
-
-
89.20%
-
67.70%
GPT-4T 2024-04-09
-
86.5%
-
-
87.60%
-
72.2%
Nemotron 340B
-
78.70%
-
73.20%
-
92.30%
-
Claude 3.5 Sonnet
88.38%
88.70%
89.00%
92.00%
93.10%
96.40%
71.10%
GPT-4o mini
-
82%
-
87.00%
-
-
70.20%
* For some models, we don't show an average value because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.
Subscribe to our Newsletter
Join 10,000+ developers staying up-to-date with the latest AI techniques and methods.
x
🎉
Thanks for joining our newsletter.
Oops! Something went wrong.
Cost and Context Window Comparison
Comparison of context window and cost per 1M tokens.
Models
Context Window
Input Cost / 1M tokens
Output Cost / 1M tokens
GPT-4
8,000
$30.00
$60.00
GPT-4-32k
32,000
$60.00
$120.00
GPT-4 Turbo
128,000
$10.00
$30.00
GPT-3.5 Turbo
16,000
$0.5
$1.5
GPT-3.5 Turbo Instruct
4,000
$1.5
$2.00
Gemini Pro
32,000
$0.125
$0.375
Gemini 1.5 Pro
128,000
$7
$21
Mistral Small
16,000
$2.00
$6.00
Mistral Medium
32,000
$2.7
$8.1
Mistral Large
32,000
$8.00
$24.00
Claude 3 Opus
200,000
$15.00
$75.00
Claude 3 Sonnet
200,000
$3.00
$15.00
Claude 3 Haiku
200,000
$0.25
$1.25
GPT4o
128,000
$5
$15
Gemini 1.5 Flash
1,000,000
$0.35
$0.70
Nemotron
4,000
-
-
Llama 3 Models
8,000
-
-
Claude 3.5 Sonnet
200,000
$3
$15
GPT-4o mini
128,000
$0.15
$0.60
HumanEval: Coding Leaderboard
Comparison of pre-trained proprietary and open-source models for code generation.
Model
HumanEval (0 shot)
GPT-4
67%
GPT-4 Turbo
87.1%
GPT-4o
90.2%
GPT-3.5
48.10%
Gemini Pro
67.70%
Gemini Ultra
74.40%
Gemini 1.5 Pro
71.90%
Mixtral 8x7B
40.20%
Mistral Large
45.1%
Claude 3 Opus
84.90%
Claude 3 Haiku
75.90%
Claude 3 Sonnet
73.00%
Llama 3 70B
75.90%
Sources
This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.