x

LLM Leaderboard

Largest context window: Gemini 1.5 Flash (1M), Claude Models (128K), GPT-4o and Turbo (128K)
Lowest input cost per 1M tokens : Gemini Pro ($0.125), Claude 3 Haiku ($0.25), Gemini 1.5 Flash ($0.35)

Model Comparison

Average Multi-choice Qs Reasoning Python coding Future Capabilties Grade school math Math Problems
GPT-4 79.45% 86.40% 95.30% 67% 83.10% 92% 52.90%
GPT-3.5 65.46% 70% 85.50% 48.10% 66.60% 57.10% 34.1%
Gemini Pro 68.28% 71.80% 84.70% 67.70% 75% 77.90% 32.60%
Gemini Ultra 79.52% 83.70% 87.80% 74.40% 83.60% 94.40% 53.20%
Gemini 1.5 Pro 80.08% 81.90% 92.50% 71.90% 84% 91.70% 58.50%
Mixtral 8x7B 59.79% 70.60% 84.40% 40.20% 60.76% 74.40% 28.40%
Llama 3 Instruct - 70B 79.23% 82% 87% 81.7% 81.3% 93% 50.4%
Llama 3 Instruct - 8B - 68.40% - 62% 61% 79.60% 30%
Grok 1.5 - 73.00% - 63% - 62.90% 23.90%
Mistral Large - 81.2% 89.2% 45.1% - 81% 45%
Claude 3 Opus 84.83% 86.80% 95.40% 84.90% 86.80% 95.00% 60.10%
Claude 3 Sonnet 76.55% 79.00% 89.00% 73.00% 82.90% 92.30% 43.10%
Claude 3 Haiku 73.08% 75.20% 85.90% 75.90% 73.70% 88.90% 38.90%
GPT4o - 88.7% - 90.2% - - 76.60%
Gemini 1.5 Flash - 78.90% - - 89.20% - 67.70%
GPT-4T 2024-04-09 - 86.5% - - 87.60% - 72.2%
Nemotron 340B - 78.70% - 73.20% - 92.30% -
Claude 3.5 Sonnet 88.38% 88.70% 89.00% 92.00% 93.10% 96.40% 71.10%
* We don't show an average value for Grok 1, Mistral L, and Qwen 14B because there is missing data.
* This comparison view excludes other benchmarks and focuses solely on MMLU, HellaSwag, HumanEval, BBHard, GSM-8K, and MATH due to the absence of data in the model reports.

Subscribe to our Newsletter

Join 10,000+ developers staying up-to-date with the latest AI techniques and methods.

x
🎉
Thanks for joining our newsletter.
Oops! Something went wrong.

Cost and Context Window Comparison

Comparison of context window and cost per 1M tokens.
Models Context Window Input Cost / 1M tokens Output Cost / 1M tokens
GPT-4 8,000 $30.00 $60.00
GPT-4-32k 32,000 $60.00 $120.00
GPT-4 Turbo 128,000 $10.00 $30.00
GPT-3.5 Turbo 16,000 $0.5 $1.5
GPT-3.5 Turbo Instruct 4,000 $1.5 $2.00
Gemini Pro 32,000 $0.125 $0.375
Gemini 1.5 Pro 128,000 $7 $21
Mistral Small 16,000 $2.00 $6.00
Mistral Medium 32,000 $2.7 $8.1
Mistral Large 32,000 $8.00 $24.00
Claude 3 Opus 200,000 $15.00 $75.00
Claude 3 Sonnet 200,000 $3.00 $15.00
Claude 3 Haiku 200,000 $0.25 $1.25
GPT4o 128,000 $5 $15
Gemini 1.5 Flash 1,000,000 $0.35 $0.70
Nemotron 4,000 - -
Llama 3 Models 8,000 - -
Claude 3.5 Sonnet 200,000 $3 $15

HumanEval: Coding Leaderboard

Comparison of pre-trained proprietary and open-source models for code generation.
Model HumanEval (0 shot)
GPT-4 67%
GPT-4 Turbo 87.1%
GPT-4o 90.2%
GPT-3.5 48.10%
Gemini Pro 67.70%
Gemini Ultra 74.40%
Gemini 1.5 Pro 71.90%
Mixtral 8x7B 40.20%
Mistral Large 45.1%
Claude 3 Opus 84.90%
Claude 3 Haiku 75.90%
Claude 3 Sonnet 73.00%
Llama 3 70B 75.90%

Sources

This leaderboard shows a comparison of capabilities, price and context window for leading commercial and open-source LLMs, based on the benchmark data provided in the models' technical reports. Updated March 2024.

Technical reports

Pricing info