Largest context window: Claude 3 (200K), GPT-4 Turbo (128K), Gemini Pro 1.5 (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), Mistral Tiny ($0.15), GPT 3.5 Turbo ($0.5)
MMLU (5shot)
GPT-4
86.40%
Claude 3 Sonnet
79.00%
Mistral Large
81.20%
Gemini Ultra
83.70%
Gemini 1.5 Pro
81.90%
Claude 3 Opus
86.80%
GSM-8K
GPT-4
92%
Claude 3 Sonnet
92.30%
Mistral Large
81%
Gemini Ultra
94.40%
Gemini 1.5 Pro
91.70%
Claude 3 Opus
95%
HumanEval (0 shot)
GPT-4
67%
Claude 3 Haiku
75.90%
Claude 3 Sonnet
73.00%
Gemini Ultra
74.40%
Gemini 1.5 Pro
71.90%
Claude 3 Opus
84.90%
Model Comparison
Average
Multi-choice Qs
Reasoning
Python coding
Future Capabilties
Grade school math
Math Problems
GPT-4
79.45%
86.40%
95.30%
67%
83.10%
92%
52.90%
GPT-3.5
65.46%
70%
85.50%
48.10%
66.60%
57.10%
34.1%
Gemini Pro
68.28%
71.80%
84.70%
67.70%
75%
77.90%
32.60%
Gemini Ultra
79.52%
83.70%
87.80%
74.40%
83.60%
94.40%
53.20%
Gemini 1.5 Pro
80.08%
81.90%
92.50%
71.90%
84%
91.70%
58.50%
Mixtral 8x7B
59.79%
70.60%
84.40%
40.20%
60.76%
74.40%
28.40%
Llama 3 Instruct - 70B
79.23%
82%
87%
81.7%
81.3%
93%
50.4%
Llama 3 Instruct - 8B
-
68.40%
-
62%
61%
79.60%
30%
Falcon 180B
42.62%
70.60%
87.50%
35.40%
37.10%
19.60%
5.50%
Grok 1.5
-
73.00%
-
63%
-
62.90%
23.90%
Qwen 14B
-
66.30%
-
32%
53.40%
61.30%
24.80%
Gemma 7B
50.60%
64.30%
81.2%
32.3%
55.10%
46.40%
24.30%
Llama 2 Chat 13B
37.63%
54.80%
80.7%
18.3%
39.40%
28.70%
3.9%
Llama 2 Chat 7B
30.84%
45.30%
77.22%
12.8%
32.6%
14.6%
2.5%
Mistral Large
-
81.2%
89.2%
45.1%
-
81%
45%
Claude 3 Opus
84.83%
86.80%
95.40%
84.90%
86.80%
95.00%
60.10%
Claude 3 Sonnet
76.55%
79.00%
89.00%
73.00%
82.90%
92.30%
43.10%
Claude 3 Haiku
73.08%
75.20%
85.90%
75.90%
73.70%
88.90%
38.90%
* We don't show an average value for Grok 1, Mistral L, and Qwen 14B because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.
Want to compare models for your use case?
Use Vellum AI to perform side-by-side comparisons for your prompt variations.
Comparison of pre-trained proprietary and open-source models for code generation.
Model
HumanEval (0 shot)
GPT-4
67%
GPT-3.5
48.10%
Claude 2
70%
Palm 2-L
37.60%
Gemini Pro
67.70%
Gemini Ultra
74.40%
Gemini 1.5 Pro
71.90%
Mixtral 8x7B
40.20%
Llama 2 - 70B
30.50%
Falcon 40B
35.40%
Qwen 14B
32.30%
Code LLaMA 34B
48.80%
Unnatural Code Llama 34B
62.20%
Gemma 7B
32.3%
Mistral Large
45.1%
Claude 3 Opus
84.90%
Claude 3 Haiku
75.90%
Claude 3 Sonnet
73.00%
Cost and Context Window Comparison
Comparison of context window and cost per 1M tokens.
Models
Context Window
Input Cost / 1M tokens
Output Cost / 1M tokens
GPT-4
8K
$30.00
$60.00
GPT-4-32k
32K
$60.00
$120.00
GPT-4 Turbo
128K
$10.00
$30.00
GPT-3.5 Turbo
16K
$0.5
$1.5
GPT-3.5 Turbo Instruct
4K
$1.5
$2.00
Gemini Pro
32K
$0.125
$0.375
Gemini 1.5 Pro
128K
$7
$21
Claude 2
100K
$8.00
$24.00
Claude 2.1
200K
$8.00
$24.00
Claude Instant
100K
$0.80
$2.40
Mistral Small
16K
$2.00
$6.00
Mistral Medium
32K
$2,7
$8.1
Mistral Large
32K
$8.00
$24.00
Claude 3 Opus
200K
$15.00
$75.00
Claude 3 Sonnet
200K
$3.00
$15.00
Claude 3 Haiku
200K
$0.25
$1.25
Sources
This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.