updated
07 Aug 2025
Coding LLM Leaderboard
This leaderboard shows what are the best LLMs for writing and editing code (released after April 2024). Data comes from model providers, open-source contributors, and Vellumโs own evaluations. Want to see how these models handle your own repos or workflows? Try ย Vellum Evals.
Top models across programming benchmarks
Best in Live CodeBench
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Grok 3 [Beta]
79.4
Grok 4
79

OpenAI o3-mini
74.1

GPT oss 20b
69

GPT oss 120b
69
Best in Aider Polyglot
Score (Percentage)
100%
90%
80%
70%
60%
50%

GPT-5
88
Gemini 2.5 Pro
82.2

OpenAI o3
81.3

OpenAI o4-mini
68.9
Claude 3.7 Sonnet [R]
64.9
Best in Agentic Coding (SWE Bench)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Grok 4
75

GPT-5
74.9
Claude Opus 4.1
74.5
Claude 4 Sonnet
72.7
Claude 4 Opus
72.5
Idependent evals
Best in Tool Use (BFCL)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Llama 3.1 405b
81.1
Llama 3.3 70b
77.3

GPT-4o
72.08

GPT-4.5
69.94
Nova Pro
68.4
Best in Adaptive Reasoning (GRIND)
Score (Percentage)
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Gemini 2.5 Pro
82.1
Claude 4 Sonnet
75
Claude 4 Opus
67.9
Claude 3.7 Sonnet [R]
60.7
Nemotron Ultra 253B
57.1
Best Overall (Humanity's Last Exam)
Score (Percentage)
50
40
30
20
10
0

GPT-5
42
Grok 4
25.4
Gemini 2.5 Pro
21.6

OpenAI o3
20.32

GPT oss 120b
14.9
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.