updated 29 May 2026

LLM Leaderboard

This LLM leaderboard displays the latest public benchmark performance for SOTA model versions released after April 2024. The data comes from model providers as well as independently run evaluations by Vellum or the open-source community. We feature results from non-saturated benchmarks, excluding outdated benchmarks (e.g. MMLU).

Top models per tasks

Best in Reasoning (GPQA Diamond)

100%95%91%86%81%
95.4%
Claude 3 Opus
94.2%
Claude Opus 4.7
93.6%
Claude Opus 4.8
93.6%
GPT-5.5
92.4%
GPT 5.2
Best in Reasoning (GPQA Diamond)
ModelScore
Claude 3 Opus95.4%
Claude Opus 4.794.2%
Claude Opus 4.893.6%
GPT-5.593.6%
GPT 5.292.4%

Best in High School Math (AIME 2025)

100%96%93%89%86%
100%
Gemini 3 Pro
100%
GPT 5.2
99.8%
Claude Opus 4.6
99.1%
Kimi K2 Thinking
98.7%
GPT oss 20b
Best in High School Math (AIME 2025)
ModelScore
Gemini 3 Pro100%
GPT 5.2100%
Claude Opus 4.699.8%
Kimi K2 Thinking99.1%
GPT oss 20b98.7%

Best in Agentic Coding (SWE Bench)

90%86%81%77%72%
88.6%
Claude Opus 4.8
87.6%
Claude Opus 4.7
82%
Claude Sonnet 4.5
80.9%
Claude Opus 4.5
80.8%
Claude Opus 4.6
Best in Agentic Coding (SWE Bench)
ModelScore
Claude Opus 4.888.6%
Claude Opus 4.787.6%
Claude Sonnet 4.582%
Claude Opus 4.580.9%
Claude Opus 4.680.8%

Best Overall (Humanity's Last Exam)

60%45%30%15%0%
57.9%
Claude Opus 4.8
45.8%
Gemini 3 Pro
44.9%
Kimi K2 Thinking
43.1%
GPT-5.5 Pro
41.4%
GPT-5.5
Best Overall (Humanity's Last Exam)
ModelScore
Claude Opus 4.857.9%
Gemini 3 Pro45.8%
Kimi K2 Thinking44.9%
GPT-5.5 Pro43.1%
GPT-5.541.4%

Best in Visual Reasoning (ARC-AGI 2)

90%68%45%23%0%
85%
GPT-5.5
68.8%
Claude Opus 4.6
58.3%
Claude Sonnet 4.6
52.9%
GPT 5.2
37.6%
Claude Opus 4.5
Best in Visual Reasoning (ARC-AGI 2)
ModelScore
GPT-5.585%
Claude Opus 4.668.8%
Claude Sonnet 4.658.3%
GPT 5.252.9%
Claude Opus 4.537.6%

Best in Multilingual Reasoning (MMMLU)

95%90%86%81%77%
91.8%
Gemini 3 Pro
91.1%
Claude Opus 4.6
90.8%
Claude Opus 4.5
89.5%
Claude Opus 4.1
89.3%
Claude Sonnet 4.6
Best in Multilingual Reasoning (MMMLU)
ModelScore
Gemini 3 Pro91.8%
Claude Opus 4.691.1%
Claude Opus 4.590.8%
Claude Opus 4.189.5%
Claude Sonnet 4.689.3%

Fastest and most affordable models

Fastest Models (Tokens/sec)

1Llama 4 Scout2600 t/s
2Llama 3.3 70b2500 t/s
3Llama 3.1 70b2100 t/s
4Llama 3.1 8b1800 t/s
5Llama 3.1 405b969 t/s

Lowest Latency (TTFT)

1GPT-5.3 Codex0.003s
2Nova Micro0.3s
3Llama 3.1 8b0.32s
4Llama 4 Scout0.33s
5Gemini 2.0 Flash0.34s

Cheapest Models (per 1M tokens)

1Nova Micro$0.04 / $0.14
2Gemma 3 27b$0.07 / $0.07
3Gemini 1.5 Flash$0.075 / $0.3
4GPT oss 20b$0.08 / $0.35
5Gemini 2.0 Flash$0.1 / $0.4

Compare models

vs
Claude Opus 4.8GPT-5.5
Context size1,000,0001,000,000
Cutoff dateJan 2026Apr 2026
I/O cost$5 / $25$5 / $30
Max output128,000128,000
Latency--
Speed--
Claude Opus 4.8GPT-5.5
GPQA Diamond
93.6
93.6
BFCL
-
-
MATH 500
-
-
AIME 2025
-
-
SWE Bench
88.6
58.6
LiveCodeBench
-
-

Compare Personal AI harnesses

Compare with
Vellum
Hermes
OpenClaw
Claude Cowork
Hermes
Open source
MIT
MIT
Apache 2.0
Proprietary
MIT
Time to set up
Easy
Moderate
Difficult
Easy
Moderate
Native channels
iOS, MacOS, Web, Voice, Email, Telegram, Slack, CLI
CLI / TUI
CLI, MacOS, Web
CLI, MacOS, Windows, Web
CLI / TUI
Memory
Managed memory
SQLite + markdown — you build the memory stack
Basic memory, context loss
Limited
SQLite + markdown — you build the memory stack
Security
Built-in security
DIY
DIY
No sandboxing
DIY
Hosting
Cloud or self-hosted
Self-hosted only
Self-hosted only
Local, limited remote
Self-hosted only
Native integrations
Managed OAuth connections
No managed connectors
No managed connectors
MCP only
No managed connectors
Schedules
Cron + Heartbeat
Cron + Heartbeat
Cron + Heartbeat
Cron only
Cron + Heartbeat
Pricing
Free + API costs, Paid plans available
Free + DIY Hosting Costs + API costs
Free + DIY Hosting Costs + API costs
Paid plans available + API costs
Free + DIY Hosting Costs + API costs

Model Comparison

ModelContext sizeCutoff dateI/O costMax outputLatencySpeed
Claude Opus 4.6200,000May 2025$5 / $25128,0001.6s67 t/s
Claude Sonnet 4.6200,000Aug 2025$3 / $1564,0000.73s55 t/s
OpenAI o3-mini200,000Dec 2024$1.1 / $4.48,00014s214 t/s
DeepSeek-R1128,000Dec 2024$0.55 / $2.198,0004s24 t/s
Claude 3.7 Sonnet [R]200,000Nov 2024$3 / $1564,0000.95s78 t/s
Gemini 2.5 Pro1,000,000Nov 2024$1.25 / $1065,00030s191 t/s
GPT-5400,000April 2025$1.25 / $10128,000--
Kimi K2 Thinking256,000April 2025$0.6 / $2.516,40025.3s79 t/s
Gemini 3 Pro10000000April 2025$2 / $1265000030.3s128 t/s
Claude 4 Sonnet200,000Mar 2025$3 / $1564,0001.9s-
Claude 4 Opus200,000Mar 2025$15 / $7532,0001.95s-
GPT oss 120b131,072April 2025$0.15 / $0.6131,0728.1s260 t/s
GPT oss 20b131,072April 2025$0.08 / $0.35131,0724s564 t/s
Claude Opus 4.1200,000April 2025$15 / $7532,000--
GPT 5.1200,000April 2025$1.25 / $10128,000--
Claude Sonnet 4.5200000April 2025$3 / $1516000031s69 t/s
GPT 5.2400kAug 2025$1.5 / $1416,0000.6s92 t/s
Claude Opus 4.81,000,000Jan 2026$5 / $25128,000--
GPT-5.51,000,000Apr 2026$5 / $30128,000--
Claude Opus 4.71,000,000Apr 2026$5 / $25128,000--
DeepSeek V3 0324128,000Dec 2024$0.27 / $1.18,0004s33 t/s
Qwen2.5-VL-32B131,000Dec 2024-8,000--
GPT-4.5 128,000Nov 2024$75 / $15016,3841.25s48 t/s
Claude 3.7 Sonnet200,000Nov 2024$3 / $15128,0000.91s78 t/s
Grok 3 [Beta]/Nov 2024----
Gemma 3 27b128,000Nov 2024$0.07 / $0.0781920.72s59 t/s
GPT-4.11,000,000December 2024$2 / $816,000--
GPT-4.1 mini1,000,000December 2024$0.4 / $1.616,000--
Claude Opus 4.5200,000April 2025$5 / $2564,000--
OpenAI o1-mini128,000Dec 2024$3 / $128,00011.43s220 t/s
Llama 4 Maverick10,000,000November 2024$0.2 / $0.68,0000.45s126 t/s
Llama 4 Scout10,000,000November 2024$0.11 / $0.348,0000.33s2600 t/s
Llama 4 Behemoth-November 2024----
GPT-4.1 nano1,000,000December 2024$0.1 / $0.432,000--
GPT-5.5 Pro1,000,000Apr 2026$30 / $180128,000--
GPT-5.3 Codex400,000Aug 2025$1.75 / $14128,0000.003s50 t/s

Context window, cost and speed comparison

Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens Speed (tokens/second) Latency
Claude Opus 4.81,000,000$5$25n/an/a
GPT-5.51,000,000$5$30n/an/a
GPT-5.5 Pro1,000,000$30$180n/an/a
Claude Opus 4.71,000,000$5$25n/an/a
Claude Opus 4.6200,000$5$2567 t/s1.6 seconds
Claude Sonnet 4.6200,000$3$1555 t/s0.73 seconds
GPT-5.3 Codex400,000$1.75$1450 t/s0.003 seconds
DeepSeek V3 0324128,000$0.27$1.133 t/s4 seconds
Qwen2.5-VL-32B131,000n/an/an/an/a
OpenAI o1-mini128,000$3$12220 t/s11.43 seconds
OpenAI o3-mini200,000$1.1$4.4214 t/s14 seconds
DeepSeek-R1128,000$0.55$2.1924 t/s4 seconds
Claude 3.7 Sonnet [R]200,000$3$1578 t/s0.95 seconds
GPT-4.5 128,000$75$15048 t/s1.25 seconds
Claude 3.7 Sonnet200,000$3$1578 t/s0.91 seconds
Gemini 2.5 Pro1,000,000$1.25$10191 t/s30 seconds
Grok 3 [Beta]/n/an/an/an/a
Gemma 3 27b128,000$0.07$0.0759 t/s0.72 seconds
Llama 4 Maverick10,000,000$0.2$0.6126 t/s0.45 seconds
Llama 4 Scout10,000,000$0.11$0.342600 t/s0.33 seconds
Llama 4 Behemothn/an/an/an/an/a
GPT-4.11,000,000$2$8n/an/a
GPT-4.1 mini1,000,000$0.4$1.6n/an/a
GPT-4.1 nano1,000,000$0.1$0.4n/an/a
Claude 4 Sonnet200,000$3$15n/a1.9 seconds
Claude 4 Opus200,000$15$75n/a1.95 seconds
GPT oss 120b131,072$0.15$0.6260 t/s8.1 seconds
GPT oss 20b131,072$0.08$0.35564 t/s4 seconds
Claude Opus 4.1200,000$15$75n/an/a
GPT-5400,000$1.25$10n/an/a
GPT 5.1200,000$1.25$10n/an/a
Kimi K2 Thinking256,000$0.6$2.579 t/s25.3 seconds
Gemini 3 Pro10000000$2$12128 t/s30.3 seconds
Claude Sonnet 4.5200000$3$1569 t/s31 seconds
Claude Opus 4.5200,000$5$25n/an/a
GPT 5.2400k$1.5$1492 t/s0.6 seconds

Benchmark glossary

GPQA Diamond
Graduate-level science questions curated by domain experts. Tests advanced reasoning across physics, chemistry, and biology.
AIME 2025
Problems from the 2025 American Invitational Mathematics Examination. Measures multi-step mathematical problem solving.
SWE-Bench Verified
Real GitHub issues from popular Python repos that the model must resolve end-to-end. Measures agentic software engineering ability.
Humanity's Last Exam
A crowd-sourced exam of extremely hard questions spanning every academic discipline. Designed to be the final exam before superhuman AI.
ARC-AGI 2
Abstract visual puzzles requiring novel pattern recognition. Tests fluid intelligence and generalization beyond training data.
MMMLU
Massive Multitask Language Understanding across multiple languages. Evaluates knowledge and reasoning in non-English contexts.

The Personal AI you were promised

GET STARTED