Learn How to Launch Production-Ready AI Products. Download Our Free Guide
February 26, 2024

LLM Benchmarks: Overview, Limits and Model Comparison

Guest Post
Anita Kirkovska
No items found.

Why do we need LLM benchmarks?

They provide a standardized method to evaluate LLMs across tasks like coding, reasoning, math, truthfulness, and more.

By comparing different models, benchmarks highlight their strengths and weaknesses.

Below we share more information on the current LLM benchmarks, their limits, and how various models stack up.

Learn how successful companies build with AI

Download this practical guide and enable your teams to innovate with AI.
Get Free Copy
If you want to compare more models, check our LLM Leaderboard here or book a demo to start using Vellum Evaluations to run these tests at scale.
Read the whole analysis in the sections that follow, and sign up for our newsletter if you want to get these analyses in your inbox!
Inspired by this, we've designed Vellum to meet these needs, and now many product and engineering teams use our suite of tools—Workflows, Evaluations, and Deployments—to build agentic workflows.

Build a Production-Ready AI System

Platform for product and engineering teams to build, deploy, and improve AI products.
Learn More

LLM orchestration with Vellum

Platform for product and engineering teams to build, deploy, and improve AI products.
Learn More

Benchmarking LLMs for Reasoning

HellaSwag - Measuring Commonsense Inference

paper | dataset [released 2019]

This test measures the commonsense reasoning of LLM models. It tests if an LLM model could complete a sentence by choosing the correct option with common reasoning among 4 options.

For example:

hellaswag database example

Questions that seem simple to humans often posed challenges for state-of-the-art (SOTA) models released in 2019, as they had difficulty with commonsense inference, achieving only about 45% accuracy. By 2024, GPT-4 has achieved the highest benchmark score with 95.3% accuracy in this area, while among open-source models, Mixtral 8x7B leads with an accuracy of 84.4% (check more models)

ARC - Reasoning benchmark

paper | dataset [released 2019]

ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans. The ARC dataset contains 7787 non-diagram, 4-way multiple-choice science questions designed for 3rd through 9th grade-level standardized tests.

DROP - A Reading Comprehension + Discrete Reasoning Benchmark

paper | dataset [released 2019]

DROP evaluates models on their ability to pull important details from English-language paragraphs and then perform distinct reasoning actions, such as adding, sorting or counting items, to find the right answer. Here’s an example:

drop benchmark dataset example

In December 2023, HuggingFace noticed that there is an issue with the normalization step with the DROP benchmark. This normalization discrepancy showed issues with handling numbers followed by certain types of whitespace and the use of punctuation as stop tokens, which led to incorrect scoring. Additionally, models that generated longer answers or were supposed to handle floating point answers did not perform as expected. An attempt to improve scoring by changing the end-of-generation token showed potential for better alignment with overall performance, but a full solution would require a significant rerun of the benchmark, which was deemed resource-intensive.

QA and Truthfulness Benchmarks

MMLU - Measuring Massive Multitask Language Understanding

paper | dataset [released 2021]

This test measures model's multitask accuracy. It covers 57 tasks including elementary mathematics, US history, computer science, law, and more at varying depths, from elementary to advanced professional level. To get high accuracy on this test, models must have extensive world knowledge and problem solving ability. Check how the top models (proprietary/open-source) stack up on this benchmark.


paper | dataset [released 2022]

This benchmark measures whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. For this benchmark, GPT-4 seems to perform the best.

Math Benchmarks

MATH - Arithmetic Reasoning

paper | dataset [released 2021]

MATH is a new benchmark, that has a dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. The authors of this benchmark found out that increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning, if scaling trends continues. Check how current models stack up on this benchmark.

GSM8K - Arithmetic Reasoning

paper | dataset [released 2021]

This dataset consists of 8.5K high quality linguistically diverse grade school math word problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem, but some models find these tasks still challenging.

Chatbot Assistance Benchmarks

There are two widely used benchmarks for evaluating human preference when it comes to chatbot assistance.

Chatbot Arena

paper | dataset

Developed by the LMSYS organization, the Chatbot Arena is a crowdsourced open platform for LLM evals. So far they’ve collected over 200K human preference votes to rank LLMs in with the Elo ranking system.

How it works: You ask a question to two anonymous AI models (like ChatGPT, Claude, or Llama) without knowing which is which. After receiving both answers, you vote for the one you think is better. You can keep asking questions and voting until you decide on a winner. Your vote only counts if you don't find out which model provided which answer during the conversation.

MT Bench

dataset | paper

MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants with LLM-as-a-judge. To automate the evaluation process, they prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.

Coding Benchmarks

HumanEval - Coding Benchmark

paper | dataset [released 2021]

This is the most used benchmark to evaluate the performance of LLMs in code generation tasks.

The HumanEval Dataset has a set of 164 handwritten programming problems that evaluate for language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. Learn how LLMs compare on this task.

MBPP - Coding Benchmark for Python problems

paper | dataset [ released 2021]

The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases (see how LLMs compare)

Limitations of LLM Benchmarks

There are two major limitations of current LLM benchmarks:

1. Restricted scope

Many benchmarks have restricted scope, usually targeting capabilities on which LLMs have already proven some proficiency. Because they focus on areas where language models are known to be good, they're not great at finding new or unexpected skills that may emerge as language models get more advanced.

2. Short life span

Also, benchmarks for language modeling often don't last long in terms of usefulness. Once language models reach a level of performance that's as good as humans on these benchmarks, the benchmarks are typically either stopped and swapped out or updated by adding harder challenges.

This short lifespan is likely because these benchmarks don't cover tasks that are much harder than what current language models can do.

It’s clear that as models continue to improve, they will achieve increasingly similar and higher scores on current benchmarks. So, we'll need to test models on future capabilities that are not possible now - with benchmarks like BBHard.

Testing Future Potential

BigBench - Predicting future potential

paper | dataset [released 2023]

BIG-bench is created to test the present and near-future capabilities and limitations of language models, and to understand how those capabilities and limitations are likely to change as models are improved.

This evaluation currently consists of 204 tasks that are believed to be beyond the capabilities of current language models. These tasks were contributed by 450 authors across 132 institutions, and the topics topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond.

Model Performance Across Key LLM Benchmarks

These are the most commonly utilized LLM Benchmarks among models’ technical reports:

  1. MMLU - Multitask accuracy
  2. HellaSwag - Reasoning
  3. HumanEval - Python coding tasks
  4. BBHard - Probing models for future capabilities
  5. GSM-8K - Grade school math
  6. MATH - Math problems with 7 difficulty levels

Here's how the top 10  LLM models rank on these benchmarks:

Click here to open the interactive view of this table.

In summary, at the time of writing this blog post (March 2024), the TL;DR of this leaderboard indicates:

  • Claude 3 Opus has the best average score across all benchmarks, and Gemini 1.5 Pro is right after (although this model isn't yet released);
  • GPT-4 is starting to fall behind Gemini and Claude models (I guess we're waiting for GPT-5);
  • Claude 3 Sonnet and Haiku are showing better results than GPT-3.5 (they're in the same category of cheaper and faster models);
  • Mixtral 8x7B is the open-source model that has best average scores across all benchmarks for open source models;

Want to evaluate your AI apps?

This data is useful when choosing a model for a general task, but if you’d like to evaluate with your own benchmark data, then we can help!

Vellum has the tooling to support the entire lifecycle management of prompts from prototyping to production. The Evaluation feature can help with testing prompts in the prototyping phase, and then evaluating with the deployed prompt, and checking for any regressions.

If you’re interested to setup your own Evaluation suite, let us know at support@vellum.ai or book a call on this link.


Join 10,000+ developers staying up-to-date with the latest AI techniques and methods.
Thanks for joining our newsletter.
Oops! Something went wrong.
Anita Kirkovska
Linkedin's logo

Founding Growth at Vellum

Anita Kirkovska, is currently leading Growth and Content Marketing at Vellum. She is a technical marketer, with an engineering background and a sharp acumen for scaling startups. She has helped SaaS startups scale and had a successful exit from an ML company. Anita writes a lot of content on generative AI to educate business founders on best practices in the field.

About the authors

No items found.

Related posts