x

LLM Leaderboard

Largest context window: Claude 3 (200K), GPT-4 Turbo (128K), Gemini Pro 1.5 (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), Mistral Tiny ($0.15), GPT 3.5 Turbo ($0.5)

Model Comparison

Average Multi-choice Qs Reasoning Python coding Future Capabilties Grade school math Math Problems
GPT-4 79.45% 86.40% 95.30% 67% 83.10% 92% 52.90%
GPT-3.5 65.46% 70% 85.50% 48.10% 66.60% 57.10% 34.1%
Gemini Pro 68.28% 71.80% 84.70% 67.70% 75% 77.90% 32.60%
Gemini Ultra 79.52% 83.70% 87.80% 74.40% 83.60% 94.40% 53.20%
Gemini 1.5 Pro 80.08% 81.90% 92.50% 71.90% 84% 91.70% 58.50%
Mixtral 8x7B 59.79% 70.60% 84.40% 40.20% 60.76% 74.40% 28.40%
Llama 3 Instruct - 70B 79.23% 82% 87% 81.7% 81.3% 93% 50.4%
Llama 3 Instruct - 8B - 68.40% - 62% 61% 79.60% 30%
Falcon 180B 42.62% 70.60% 87.50% 35.40% 37.10% 19.60% 5.50%
Grok 1.5 - 73.00% - 63% - 62.90% 23.90%
Qwen 14B - 66.30% - 32% 53.40% 61.30% 24.80%
Gemma 7B 50.60% 64.30% 81.2% 32.3% 55.10% 46.40% 24.30%
Llama 2 Chat 13B 37.63% 54.80% 80.7% 18.3% 39.40% 28.70% 3.9%
Llama 2 Chat 7B 30.84% 45.30% 77.22% 12.8% 32.6% 14.6% 2.5%
Mistral Large - 81.2% 89.2% 45.1% - 81% 45%
Claude 3 Opus 84.83% 86.80% 95.40% 84.90% 86.80% 95.00% 60.10%
Claude 3 Sonnet 76.55% 79.00% 89.00% 73.00% 82.90% 92.30% 43.10%
Claude 3 Haiku 73.08% 75.20% 85.90% 75.90% 73.70% 88.90% 38.90%
* We don't show an average value for Grok 1, Mistral L, and Qwen 14B because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.

Want to compare models for your use case?

Use Vellum AI to perform side-by-side comparisons for your prompt variations.

x
Learn More

HumanEval: Coding Leaderboard

Comparison of pre-trained proprietary and open-source models for code generation.
Model HumanEval (0 shot)
GPT-4 67%
GPT-3.5 48.10%
Claude 2 70%
Palm 2-L 37.60%
Gemini Pro 67.70%
Gemini Ultra 74.40%
Gemini 1.5 Pro 71.90%
Mixtral 8x7B 40.20%
Llama 2 - 70B 30.50%
Falcon 40B 35.40%
Qwen 14B 32.30%
Code LLaMA 34B 48.80%
Unnatural Code Llama 34B 62.20%
Gemma 7B 32.3%
Mistral Large 45.1%
Claude 3 Opus 84.90%
Claude 3 Haiku 75.90%
Claude 3 Sonnet 73.00%

Cost and Context Window Comparison

Comparison of context window and cost per 1M tokens.
Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens
GPT-4 8K $30.00 $60.00
GPT-4-32k 32K $60.00 $120.00
GPT-4 Turbo 128K $10.00 $30.00
GPT-3.5 Turbo 16K $0.5 $1.5
GPT-3.5 Turbo Instruct 4K $1.5 $2.00
Gemini Pro 32K $0.125 $0.375
Gemini 1.5 Pro 128K $7 $21
Claude 2 100K $8.00 $24.00
Claude 2.1 200K $8.00 $24.00
Claude Instant 100K $0.80 $2.40
Mistral Small 16K $2.00 $6.00
Mistral Medium 32K $2,7 $8.1
Mistral Large 32K $8.00 $24.00
Claude 3 Opus 200K $15.00 $75.00
Claude 3 Sonnet 200K $3.00 $15.00
Claude 3 Haiku 200K $0.25 $1.25

Sources

This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.

Technical reports

Pricing info