x

Vibe-code agents with Vellum

Go to agent builder

Thank you!
Your submission has been received!

Oops! Something went wrong while submitting the form.

OS Leaderboard LLM Leaderboard

updated

18 Nov 2025

Coding LLM Leaderboard

This leaderboard shows what are the best LLMs for writing and editing code (released after April 2024). Data comes from model providers, open-source contributors, and Vellum’s own evaluations. Want to see how these models handle your own repos or workflows? Try Vellum Evals.

Top models across programming benchmarks

Best in Live CodeBench

A holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time

Score (Percentage)

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Kimi K2 Thinking

83.1

Gemini 3 Pro

79.7

Grok 3 [Beta]

79.4

Grok 4

79

OpenAI o3-mini

74.1

Best in Aider Polyglot

Evaluate an LLM’s ability to follow instructions and edit code successfully without human intervention

Score (Percentage)

100%

90%

80%

70%

60%

50%

Claude Opus 4.5

89.4

GPT-5

88

Gemini 2.5 Pro

82.2

OpenAI o3

81.3

OpenAI o4-mini

68.9

Best in Agentic Coding (SWE Bench)

Data from the SWE Bechmark that evaluates if LLMs can resolve GitHub Issues. It measures agentic reasoning.

Score (Percentage)

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Claude Sonnet 4.5

82

Claude Opus 4.5

80.9

GPT 5.2

80

GPT 5.1

76.3

Gemini 3 Pro

76.2

Idependent evals

Best in Tool Use (BFCL)

Data from the BFCL benchmark measuring LLM's capability to use tools.

Score (Percentage)

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Llama 3.1 405b

81.1

Llama 3.3 70b

77.3

GPT-4o

72.08

GPT-4.5

69.94

Nova Pro

68.4

Best in Adaptive Reasoning (GRIND)

Independently run Vellum benchmark that tests how well models adapt to new contexts instead of relying on pre-learned patterns.

Score (Percentage)

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Gemini 2.5 Pro

82.1

Claude 4 Sonnet

75

Claude 4 Opus

67.9

Claude 3.7 Sonnet [R]

60.7

Nemotron Ultra 253B

57.1

Best Overall (Humanity's Last Exam)

Data from the Humanity's Last Exam, which is the most challenging benchmark across multiple domains.

Score (Percentage)

50

40

30

20

10

0

Gemini 3 Pro

45.8

Kimi K2 Thinking

44.9

GPT-5

35.2

Grok 4

25.4

Gemini 2.5 Pro

21.6

CODING BENCHMARKS

Model Comparison

Showing 0 out of 20 results

This is some text inside of a div block.

GPT 5.2

This is some text inside of a div block.

Claude Opus 4.5

This is some text inside of a div block.

Claude Sonnet 4.5

This is some text inside of a div block.

Gemini 3 Pro

This is some text inside of a div block.

Kimi K2 Thinking

This is some text inside of a div block.

GPT 5.1

This is some text inside of a div block.

Claude Haiku 4.5

This is some text inside of a div block.

GPT-5

This is some text inside of a div block.

Claude Opus 4.1

This is some text inside of a div block.

GPT oss 20b

This is some text inside of a div block.

GPT oss 120b

This is some text inside of a div block.

Grok 4

This is some text inside of a div block.

Claude 4 Opus

This is some text inside of a div block.

Claude 4 Sonnet

This is some text inside of a div block.

Gemini 2.5 Flash

This is some text inside of a div block.

OpenAI o3

This is some text inside of a div block.

OpenAI o4-mini

This is some text inside of a div block.

Nemotron Ultra 253B

This is some text inside of a div block.

GPT-4.1 nano

This is some text inside of a div block.

GPT-4.1 mini

This is some text inside of a div block.

GPT-4.1

This is some text inside of a div block.

Llama 4 Behemoth

This is some text inside of a div block.

Llama 4 Scout

This is some text inside of a div block.

Llama 4 Maverick

This is some text inside of a div block.

Gemma 3 27b

This is some text inside of a div block.

Grok 3 [Beta]

This is some text inside of a div block.

Gemini 2.5 Pro

This is some text inside of a div block.

Claude 3.7 Sonnet

This is some text inside of a div block.

GPT-4.5

This is some text inside of a div block.

Claude 3.7 Sonnet [R]

This is some text inside of a div block.

DeepSeek-R1

This is some text inside of a div block.

OpenAI o3-mini

This is some text inside of a div block.

OpenAI o1-mini

This is some text inside of a div block.

Qwen2.5-VL-32B

This is some text inside of a div block.

DeepSeek V3 0324

This is some text inside of a div block.

OpenAI o1

This is some text inside of a div block.

Gemini 2.0 Flash

This is some text inside of a div block.

Llama 3.3 70b

This is some text inside of a div block.

Nova Pro

This is some text inside of a div block.

Claude 3.5 Haiku

This is some text inside of a div block.

Llama 3.1 405b

This is some text inside of a div block.

GPT-4o mini

This is some text inside of a div block.

GPT-4o

This is some text inside of a div block.

Claude 3.5 Sonnet

Models

Data from the AIME 2024, a competitive high school math benchmark.

Data from the SWE Bechmark that evaluates if LLMs can resolve GitHub Issues. It measures agentic reasoning.

Data from the MATH 500 benchmark that evaluates mathematical problem-solving capabilities.

BFCL benchmark data, which measures how well LLMs use tools.

Data from the Alder Polyglot benchmark that measures LLM's capabilities for code editing.

GPT 5.2

400k

n/a

%

80

%

n/a

%

n/a

%

n/a

%

n/a

Claude Opus 4.5

200,000

n/a

%

80.9

%

n/a

%

n/a

%

n/a

89.4

%

n/a

Claude Sonnet 4.5

200000

n/a

%

82

%

n/a

%

n/a

%

n/a

%

n/a

Gemini 3 Pro

10000000

79.7

n/a

%

76.2

%

n/a

%

n/a

%

n/a

%

n/a

Kimi K2 Thinking

256,000

83.1

n/a

%

71.3

%

n/a

%

n/a

%

n/a

%

n/a

GPT 5.1

200,000

n/a

%

76.3

%

n/a

%

n/a

%

n/a

%

n/a

Claude Haiku 4.5

n/a

%

73.3

%

n/a

%

n/a

%

n/a

%

n/a

GPT-5

400,000

n/a

%

74.9

%

n/a

%

n/a

%

n/a

88

%

n/a

Claude Opus 4.1

200,000

n/a

%

74.5

%

n/a

%

n/a

%

n/a

%

n/a

GPT oss 20b

131,072

69

n/a

%

%

n/a

%

n/a

%

n/a

%

n/a

GPT oss 120b

131,072

69

n/a

%

%

n/a

%

n/a

%

n/a

%

n/a

Grok 4

256000

79

n/a

%

75

%

n/a

%

n/a

%

n/a

%

n/a

Claude 4 Opus

200,000

n/a

%

72.5

%

n/a

%

n/a

%

n/a

%

n/a

Claude 4 Sonnet

200,000

n/a

%

72.7

%

n/a

%

n/a

%

n/a

%

n/a

Gemini 2.5 Flash

1,000,000

63.5

n/a

%

%

n/a

%

n/a

%

n/a

51.1

%

n/a

OpenAI o3

200,000

n/a

%

69.1

%

n/a

%

n/a

%

n/a

81.3

%

n/a

OpenAI o4-mini

200,000

n/a

%

68.1

%

n/a

%

n/a

%

n/a

68.9

%

n/a

Nemotron Ultra 253B

64

n/a

%

%

n/a

%

n/a

%

n/a

%

n/a

GPT-4.1 nano

1,000,000

n/a

%

%

n/a

%

n/a

%

n/a

9.8

%

n/a

GPT-4.1 mini

1,000,000

n/a

%

23.6

%

n/a

%

n/a

%

n/a

34.7

%

n/a

GPT-4.1

1,000,000

52

n/a

%

55

%

n/a

%

n/a

%

n/a

%

n/a

Llama 4 Behemoth

49.4

n/a

%

%

n/a

95

%

n/a

%

n/a

%

n/a

Llama 4 Scout

10,000,000

32.8

n/a

%

%

n/a

%

n/a

%

n/a

%

n/a

Llama 4 Maverick

10,000,000

41

n/a

%

%

n/a

%

n/a

%

n/a

15.6

%

n/a

Gemma 3 27b

128,000

n/a

%

10.2

%

n/a

89

%

n/a

59.11

%

n/a

4.9

%

n/a

Grok 3 [Beta]

/

79.4

n/a

%

%

n/a

%

n/a

%

n/a

%

n/a

Gemini 2.5 Pro

1,000,000

69

n/a

%

59.6

%

n/a

%

n/a

%

n/a

82.2

%

n/a

Claude 3.7 Sonnet

200,000

n/a

%

62.3

%

n/a

82.2

%

n/a

58.3

%

n/a

60.4

%

n/a

GPT-4.5

128,000

n/a

%

38

%

n/a

%

n/a

69.94

%

n/a

44.9

%

n/a

Claude 3.7 Sonnet [R]

200,000

n/a

%

70.3

%

n/a

96.2

%

n/a

58.3

%

n/a

64.9

%

n/a

DeepSeek-R1

128,000

64.3

n/a

%

49.2

%

n/a

97.3

%

n/a

57.53

%

n/a

64

%

n/a

OpenAI o3-mini

200,000

74.1

n/a

%

61

%

n/a

97.9

%

n/a

65.12

%

n/a

60.4

%

n/a

OpenAI o1-mini

128,000

n/a

%

%

n/a

90

%

n/a

52.2

%

n/a

32.9

%

n/a

Qwen2.5-VL-32B

131,000

n/a

%

18.8

%

n/a

82.2

%

n/a

62.79

%

n/a

62.84

%

n/a

DeepSeek V3 0324

128,000

41

n/a

%

38.8

%

n/a

94

%

n/a

58.55

%

n/a

55.1

%

n/a

OpenAI o1

200,000

59.5

n/a

%

48.9

%

n/a

96.4

%

n/a

67.87

%

n/a

61.7

%

n/a

Gemini 2.0 Flash

1,000,000

n/a

%

51.8

%

n/a

89.7

%

n/a

60.42

%

n/a

22.2

%

n/a

Llama 3.3 70b

128,000

n/a

%

%

n/a

77

%

n/a

77.3

%

n/a

51.43

%

n/a

Nova Pro

300,000

n/a

%

%

n/a

76.6

%

n/a

68.4

%

n/a

61.38

%

n/a

Claude 3.5 Haiku

200,000

n/a

%

40.6

%

n/a

69.4

%

n/a

54.31

%

n/a

28

%

n/a

Llama 3.1 405b

128,000

n/a

%

%

n/a

73.8

%

n/a

81.1

%

n/a

%

n/a

GPT-4o mini

128,000

n/a

%

%

n/a

70.2

%

n/a

64.1

%

n/a

3.6

%

n/a

GPT-4o

128,000

n/a

%

31

%

n/a

60.3

%

n/a

72.08

%

n/a

27.1

%

n/a

Claude 3.5 Sonnet

200,000

n/a

%

49

%

n/a

78

%

n/a

56.46

%

n/a

51.6

%

n/a

*

This comparison view excludes other benchmarks and focuses on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

MODEL COMPARISON

Context window, cost and speed comparison

Showing 0 out of 20 results

This is some text inside of a div block.

GPT 5.2

This is some text inside of a div block.

Claude Opus 4.5

This is some text inside of a div block.

Claude Sonnet 4.5

This is some text inside of a div block.

Gemini 3 Pro

This is some text inside of a div block.

Kimi K2 Thinking

This is some text inside of a div block.

GPT 5.1

This is some text inside of a div block.

Claude Haiku 4.5

This is some text inside of a div block.

GPT-5

This is some text inside of a div block.

Claude Opus 4.1

This is some text inside of a div block.

GPT oss 20b

This is some text inside of a div block.

GPT oss 120b

This is some text inside of a div block.

Grok 4

This is some text inside of a div block.

Claude 4 Opus

This is some text inside of a div block.

Claude 4 Sonnet

This is some text inside of a div block.

Gemini 2.5 Flash

This is some text inside of a div block.

OpenAI o3

This is some text inside of a div block.

OpenAI o4-mini

This is some text inside of a div block.

Nemotron Ultra 253B

This is some text inside of a div block.

GPT-4.1 nano

This is some text inside of a div block.

GPT-4.1 mini

This is some text inside of a div block.

GPT-4.1

This is some text inside of a div block.

Llama 4 Behemoth

This is some text inside of a div block.

Llama 4 Scout

This is some text inside of a div block.

Llama 4 Maverick

This is some text inside of a div block.

Gemma 3 27b

This is some text inside of a div block.

Grok 3 [Beta]

This is some text inside of a div block.

Gemini 2.5 Pro

This is some text inside of a div block.

Claude 3.7 Sonnet

This is some text inside of a div block.

GPT-4.5

This is some text inside of a div block.

Claude 3.7 Sonnet [R]

This is some text inside of a div block.

DeepSeek-R1

This is some text inside of a div block.

OpenAI o3-mini

This is some text inside of a div block.

OpenAI o1-mini

This is some text inside of a div block.

Qwen2.5-VL-32B

This is some text inside of a div block.

DeepSeek V3 0324

This is some text inside of a div block.

OpenAI o1

This is some text inside of a div block.

Gemini 2.0 Flash

This is some text inside of a div block.

Llama 3.3 70b

This is some text inside of a div block.

Nova Pro

This is some text inside of a div block.

Claude 3.5 Haiku

This is some text inside of a div block.

Llama 3.1 405b

This is some text inside of a div block.

GPT-4o mini

This is some text inside of a div block.

GPT-4o

This is some text inside of a div block.

Claude 3.5 Sonnet

Models

Input Cost / 1M tokens

Output Cost / 1M tokens

Speed (tokens/second)

GPT 5.2

400k

$

1.5

n/a

$

14

n/a

92

n/a

t/s

0.6

seconds

n/a

Claude Opus 4.5

200,000

$

5

n/a

$

25

n/a

n/a

t/s

seconds

n/a

Claude Sonnet 4.5

200000

$

3

n/a

$

15

n/a

69

n/a

t/s

31

seconds

n/a

Gemini 3 Pro

10000000

$

2

n/a

$

12

n/a

128

n/a

t/s

30.3

seconds

n/a

Kimi K2 Thinking

256,000

$

0.6

n/a

$

2.5

n/a

79

n/a

t/s

25.3

seconds

n/a

GPT 5.1

200,000

$

1.25

n/a

$

10

n/a

n/a

t/s

seconds

n/a

GPT-5

400,000

$

1.25

n/a

$

10

n/a

n/a

t/s

seconds

n/a

Claude Opus 4.1

200,000

$

15

n/a

$

75

n/a

n/a

t/s

seconds

n/a

GPT oss 20b

131,072

$

0.08

n/a

$

0.35

n/a

564

n/a

t/s

4

seconds

n/a

GPT oss 120b

131,072

$

0.15

n/a

$

0.6

n/a

260

n/a

t/s

8.1

seconds

n/a

Grok 4

256000

$

n/a

$

n/a

52

n/a

t/s

13.3

seconds

n/a

Claude 4 Opus

200,000

$

15

n/a

$

75

n/a

n/a

t/s

1.95

seconds

n/a

Claude 4 Sonnet

200,000

$

3

n/a

$

15

n/a

n/a

t/s

1.9

seconds

n/a

Gemini 2.5 Flash

1,000,000

$

0.15

n/a

$

0.6

n/a

200

n/a

t/s

0.35

seconds

n/a

OpenAI o3

200,000

$

10

n/a

$

40

n/a

94

n/a

t/s

28

seconds

n/a

OpenAI o4-mini

200,000

$

1.1

n/a

$

4.4

n/a

135

n/a

t/s

35.3

seconds

n/a

GPT-4.1 nano

1,000,000

$

0.1

n/a

$

0.4

n/a

n/a

t/s

seconds

n/a

GPT-4.1 mini

1,000,000

$

0.4

n/a

$

1.6

n/a

n/a

t/s

seconds

n/a

GPT-4.1

1,000,000

$

2

n/a

$

8

n/a

n/a

t/s

seconds

n/a

Llama 4 Scout

10,000,000

$

0.11

n/a

$

0.34

n/a

2600

n/a

t/s

0.33

seconds

n/a

Llama 4 Maverick

10,000,000

$

0.2

n/a

$

0.6

n/a

126

n/a

t/s

0.45

seconds

n/a

Gemma 3 27b

128,000

$

0.07

n/a

$

0.07

n/a

59

n/a

t/s

0.72

seconds

n/a

Grok 3 [Beta]

/

$

n/a

$

n/a

n/a

t/s

seconds

n/a

Gemini 2.5 Pro

1,000,000

$

1.25

n/a

$

10

n/a

191

n/a

t/s

30

seconds

n/a

Claude 3.7 Sonnet

200,000

$

3

n/a

$

15

n/a

78

n/a

t/s

0.91

seconds

n/a

GPT-4.5

128,000

$

75

n/a

$

150

n/a

48

n/a

t/s

1.25

seconds

n/a

Claude 3.7 Sonnet [R]

200,000

$

3

n/a

$

15

n/a

78

n/a

t/s

0.95

seconds

n/a

DeepSeek-R1

128,000

$

0.55

n/a

$

2.19

n/a

24

n/a

t/s

4

seconds

n/a

OpenAI o3-mini

200,000

$

1.1

n/a

$

4.4

n/a

214

n/a

t/s

14

seconds

n/a

OpenAI o1-mini

128,000

$

3

n/a

$

12

n/a

220

n/a

t/s

11.43

seconds

n/a

Qwen2.5-VL-32B

131,000

$

n/a

$

n/a

n/a

t/s

seconds

n/a

DeepSeek V3 0324

128,000

$

0.27

n/a

$

1.1

n/a

33

n/a

t/s

4

seconds

n/a

OpenAI o1

200,000

$

15

n/a

$

60

n/a

100

n/a

t/s

30

seconds

n/a

Gemini 2.0 Flash

1,000,000

$

0.1

n/a

$

0.4

n/a

257

n/a

t/s

0.34

seconds

n/a

Llama 3.3 70b

128,000

$

0.59

n/a

$

0.7

n/a

2500

n/a

t/s

0.52

seconds

n/a

Nova Pro

300,000

$

1

n/a

$

4

n/a

128

n/a

t/s

0.64

seconds

n/a

Claude 3.5 Haiku

200,000

$

0.8

n/a

$

4

n/a

66

n/a

t/s

0.88

seconds

n/a

Llama 3.1 405b

128,000

$

3.5

n/a

$

3.5

n/a

969

n/a

t/s

0.73

seconds

n/a

GPT-4o mini

128,000

$

0.15

n/a

$

0.6

n/a

65

n/a

t/s

0.35

seconds

n/a

GPT-4o

128,000

$

2.5

n/a

$

10

n/a

143

n/a

t/s

0.51

seconds

n/a

Claude 3.5 Sonnet

200,000

$

3

n/a

$

15

n/a

78

n/a

t/s

1.22

seconds

n/a

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Build AI agents by chatting with AI

RESOURCES

Case Studies
Reasoning models
Guides
Product Updates
Model Comparison
Documentation
LLM Leaderboard

PRODUCTS

Prompt Engineering
Document Retrieval
Orchestration
evaluations
Deployments
Monitoring
SDK

COMPANY

Blog
Careers
Contact Us

Affiliate program rules

SOCIALS

LinkedIn
Twitter
Youtube