x

Test models side by side in Vellum

Get Started
Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
updated
07 Aug 2025

Coding LLM Leaderboard

This leaderboard shows what are the best LLMs for writing and editing code (released after April 2024). Data comes from model providers, open-source contributors, and Vellumโ€™s own evaluations. Want to see how these models handle your own repos or workflows? Try ย Vellum Evals.

Top models across programming benchmarks
Best in Live CodeBench
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Grok 3 [Beta]
79.4
Grok 4
79
OpenAI o3-mini
74.1
GPT oss 20b
69
GPT oss 120b
69
Best in Aider Polyglot
Score (Percentage)
100%
90%
80%
70%
60%
50%
GPT-5
88
Gemini 2.5 Pro
82.2
OpenAI o3
81.3
OpenAI o4-mini
68.9
Claude 3.7 Sonnet [R]
64.9
Best in Agentic Coding (SWE Bench)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Grok 4
75
GPT-5
74.9
Claude Opus 4.1
74.5
Claude 4 Sonnet
72.7
Claude 4 Opus
72.5
Idependent evals
Best in Tool Use (BFCL)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Llama 3.1 405b
81.1
Llama 3.3 70b
77.3
GPT-4o
72.08
GPT-4.5
69.94
Nova Pro
68.4
Best in Adaptive Reasoning (GRIND)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Gemini 2.5 Pro
82.1
Claude 4 Sonnet
75
Claude 4 Opus
67.9
Claude 3.7 Sonnet [R]
60.7
Nemotron Ultra 253B
57.1
Best Overall (Humanity's Last Exam)
Score (Percentage)
50
40
30
20
10
0
GPT-5
42
Grok 4
25.4
Gemini 2.5 Pro
21.6
OpenAI o3
20.32
GPT oss 120b
14.9
CODING BENCHMARKS
Model Comparison
Showing 0 out of 20 results
Reset All
GPT-5
400,000
n/a
%
74.9
%
n/a
%
n/a
%
n/a
88
%
n/a
Claude Opus 4.1
200,000
n/a
%
74.5
%
n/a
%
n/a
%
n/a
%
n/a
GPT oss 20b
131,072
69
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
GPT oss 120b
131,072
69
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
Grok 4
256000
79
n/a
%
75
%
n/a
%
n/a
%
n/a
%
n/a
Claude 4 Opus
200,000
n/a
%
72.5
%
n/a
%
n/a
%
n/a
%
n/a
Claude 4 Sonnet
200,000
n/a
%
72.7
%
n/a
%
n/a
%
n/a
%
n/a
Gemini 2.5 Flash
1,000,000
63.5
n/a
%
%
n/a
%
n/a
%
n/a
51.1
%
n/a
OpenAI o3
200,000
n/a
%
69.1
%
n/a
%
n/a
%
n/a
81.3
%
n/a
OpenAI o4-mini
200,000
n/a
%
68.1
%
n/a
%
n/a
%
n/a
68.9
%
n/a
Nemotron Ultra 253B
64
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
GPT-4.1 nano
1,000,000
n/a
%
%
n/a
%
n/a
%
n/a
9.8
%
n/a
GPT-4.1 mini
1,000,000
n/a
%
23.6
%
n/a
%
n/a
%
n/a
34.7
%
n/a
GPT-4.1
1,000,000
52
n/a
%
55
%
n/a
%
n/a
%
n/a
%
n/a
Llama 4 Behemoth
49.4
n/a
%
%
n/a
95
%
n/a
%
n/a
%
n/a
Llama 4 Scout
10,000,000
32.8
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
Llama 4 Maverick
10,000,000
41
n/a
%
%
n/a
%
n/a
%
n/a
15.6
%
n/a
Gemma 3 27b
128,000
n/a
%
10.2
%
n/a
89
%
n/a
59.11
%
n/a
4.9
%
n/a
Grok 3 [Beta]
/
79.4
n/a
%
%
n/a
%
n/a
%
n/a
%
n/a
Gemini 2.5 Pro
1,000,000
69
n/a
%
59.6
%
n/a
%
n/a
%
n/a
82.2
%
n/a
Claude 3.7 Sonnet
200,000
n/a
%
62.3
%
n/a
82.2
%
n/a
58.3
%
n/a
60.4
%
n/a
GPT-4.5
128,000
n/a
%
38
%
n/a
%
n/a
69.94
%
n/a
44.9
%
n/a
Claude 3.7 Sonnet [R]
200,000
n/a
%
70.3
%
n/a
96.2
%
n/a
58.3
%
n/a
64.9
%
n/a
DeepSeek-R1
128,000
64.3
n/a
%
49.2
%
n/a
97.3
%
n/a
57.53
%
n/a
64
%
n/a
OpenAI o3-mini
200,000
74.1
n/a
%
61
%
n/a
97.9
%
n/a
65.12
%
n/a
60.4
%
n/a
OpenAI o1-mini
128,000
n/a
%
%
n/a
90
%
n/a
52.2
%
n/a
32.9
%
n/a
Qwen2.5-VL-32B
131,000
n/a
%
18.8
%
n/a
82.2
%
n/a
62.79
%
n/a
62.84
%
n/a
DeepSeek V3 0324
128,000
41
n/a
%
38.8
%
n/a
94
%
n/a
58.55
%
n/a
55.1
%
n/a
OpenAI o1
200,000
59.5
n/a
%
48.9
%
n/a
96.4
%
n/a
67.87
%
n/a
61.7
%
n/a
Gemini 2.0 Flash
1,000,000
n/a
%
51.8
%
n/a
89.7
%
n/a
60.42
%
n/a
22.2
%
n/a
Llama 3.3 70b
128,000
n/a
%
%
n/a
77
%
n/a
77.3
%
n/a
51.43
%
n/a
Nova Pro
300,000
n/a
%
%
n/a
76.6
%
n/a
68.4
%
n/a
61.38
%
n/a
Claude 3.5 Haiku
200,000
n/a
%
40.6
%
n/a
69.4
%
n/a
54.31
%
n/a
28
%
n/a
Llama 3.1 405b
128,000
n/a
%
%
n/a
73.8
%
n/a
81.1
%
n/a
%
n/a
GPT-4o mini
128,000
n/a
%
%
n/a
70.2
%
n/a
64.1
%
n/a
3.6
%
n/a
GPT-4o
128,000
n/a
%
31
%
n/a
60.3
%
n/a
72.08
%
n/a
27.1
%
n/a
Claude 3.5 Sonnet
200,000
n/a
%
49
%
n/a
78
%
n/a
56.46
%
n/a
51.6
%
n/a
*
This comparison view excludes other benchmarks and focuses on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
MODEL COMPARISON
Context window, cost and speed comparison
Showing 0 out of 20 results
Reset All
GPT-5
400,000
$
1.25
n/a
$
10
n/a
n/a
t/s
seconds
n/a
Claude Opus 4.1
200,000
$
15
n/a
$
75
n/a
n/a
t/s
seconds
n/a
GPT oss 20b
131,072
$
0.08
n/a
$
0.35
n/a
564
n/a
t/s
4
seconds
n/a
GPT oss 120b
131,072
$
0.15
n/a
$
0.6
n/a
260
n/a
t/s
8.1
seconds
n/a
Grok 4
256000
$
n/a
$
n/a
52
n/a
t/s
13.3
seconds
n/a
Claude 4 Opus
200,000
$
15
n/a
$
75
n/a
n/a
t/s
1.95
seconds
n/a
Claude 4 Sonnet
200,000
$
3
n/a
$
15
n/a
n/a
t/s
1.9
seconds
n/a
Gemini 2.5 Flash
1,000,000
$
0.15
n/a
$
0.6
n/a
200
n/a
t/s
0.35
seconds
n/a
OpenAI o3
200,000
$
10
n/a
$
40
n/a
94
n/a
t/s
28
seconds
n/a
OpenAI o4-mini
200,000
$
1.1
n/a
$
4.4
n/a
135
n/a
t/s
35.3
seconds
n/a
GPT-4.1 nano
1,000,000
$
0.1
n/a
$
0.4
n/a
n/a
t/s
seconds
n/a
GPT-4.1 mini
1,000,000
$
0.4
n/a
$
1.6
n/a
n/a
t/s
seconds
n/a
GPT-4.1
1,000,000
$
2
n/a
$
8
n/a
n/a
t/s
seconds
n/a
Llama 4 Scout
10,000,000
$
0.11
n/a
$
0.34
n/a
2600
n/a
t/s
0.33
seconds
n/a
Llama 4 Maverick
10,000,000
$
0.2
n/a
$
0.6
n/a
126
n/a
t/s
0.45
seconds
n/a
Gemma 3 27b
128,000
$
0.07
n/a
$
0.07
n/a
59
n/a
t/s
0.72
seconds
n/a
Grok 3 [Beta]
/
$
n/a
$
n/a
n/a
t/s
seconds
n/a
Gemini 2.5 Pro
1,000,000
$
1.25
n/a
$
10
n/a
191
n/a
t/s
30
seconds
n/a
Claude 3.7 Sonnet
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
0.91
seconds
n/a
GPT-4.5
128,000
$
75
n/a
$
150
n/a
48
n/a
t/s
1.25
seconds
n/a
Claude 3.7 Sonnet [R]
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
0.95
seconds
n/a
DeepSeek-R1
128,000
$
0.55
n/a
$
2.19
n/a
24
n/a
t/s
4
seconds
n/a
OpenAI o3-mini
200,000
$
1.1
n/a
$
4.4
n/a
214
n/a
t/s
14
seconds
n/a
OpenAI o1-mini
128,000
$
3
n/a
$
12
n/a
220
n/a
t/s
11.43
seconds
n/a
Qwen2.5-VL-32B
131,000
$
n/a
$
n/a
n/a
t/s
seconds
n/a
DeepSeek V3 0324
128,000
$
0.27
n/a
$
1.1
n/a
33
n/a
t/s
4
seconds
n/a
OpenAI o1
200,000
$
15
n/a
$
60
n/a
100
n/a
t/s
30
seconds
n/a
Gemini 2.0 Flash
1,000,000
$
0.1
n/a
$
0.4
n/a
257
n/a
t/s
0.34
seconds
n/a
Llama 3.3 70b
128,000
$
0.59
n/a
$
0.7
n/a
2500
n/a
t/s
0.52
seconds
n/a
Nova Pro
300,000
$
1
n/a
$
4
n/a
128
n/a
t/s
0.64
seconds
n/a
Claude 3.5 Haiku
200,000
$
0.8
n/a
$
4
n/a
66
n/a
t/s
0.88
seconds
n/a
Llama 3.1 405b
128,000
$
3.5
n/a
$
3.5
n/a
969
n/a
t/s
0.73
seconds
n/a
GPT-4o mini
128,000
$
0.15
n/a
$
0.6
n/a
65
n/a
t/s
0.35
seconds
n/a
GPT-4o
128,000
$
2.5
n/a
$
10
n/a
143
n/a
t/s
0.51
seconds
n/a
Claude 3.5 Sonnet
200,000
$
3
n/a
$
15
n/a
78
n/a
t/s
1.22
seconds
n/a
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.