TABLE OF CONTENTS
Why do we need LLM benchmarks?
They provide a standardized method to evaluate LLMs across tasks like coding, reasoning, math, truthfulness, and more.
By comparing different models, benchmarks highlight their strengths and weaknesses.
Below we share more information on the current LLM benchmarks, their limits, and how various models stack up.
HellaSwag - Measuring Commonsense Inference
paper | dataset [released 2019]
This test measures the commonsense reasoning of LLM models. It tests if an LLM model could complete a sentence by choosing the correct option with common reasoning among 4 options.
For example:
Questions that seem simple to humans often posed challenges for state-of-the-art (SOTA) models released in 2019, as they had difficulty with commonsense inference, achieving only about 45% accuracy. By 2024, GPT-4 has achieved the highest benchmark score with 95.3% accuracy in this area, while among open-source models, Mixtral 8x7B leads with an accuracy of 84.4% (check more models)
ARC - Reasoning benchmark
paper | dataset [released 2019]
ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans. The ARC dataset contains 7787 non-diagram, 4-way multiple-choice science questions designed for 3rd through 9th grade-level standardized tests.
DROP - A Reading Comprehension + Discrete Reasoning Benchmark
paper | dataset [released 2019]
DROP evaluates models on their ability to pull important details from English-language paragraphs and then perform distinct reasoning actions, such as adding, sorting or counting items, to find the right answer. Here’s an example:
In December 2023, HuggingFace noticed that there is an issue with the normalization step with the DROP benchmark. This normalization discrepancy showed issues with handling numbers followed by certain types of whitespace and the use of punctuation as stop tokens, which led to incorrect scoring. Additionally, models that generated longer answers or were supposed to handle floating point answers did not perform as expected. An attempt to improve scoring by changing the end-of-generation token showed potential for better alignment with overall performance, but a full solution would require a significant rerun of the benchmark, which was deemed resource-intensive.
MMLU - Measuring Massive Multitask Language Understanding
paper | dataset [released 2021]
This test measures model's multitask accuracy. It covers 57 tasks including elementary mathematics, US history, computer science, law, and more at varying depths, from elementary to advanced professional level. To get high accuracy on this test, models must have extensive world knowledge and problem solving ability. Check how the top models (proprietary/open-source) stack up on this benchmark.
TruthfulQA
paper | dataset [released 2022]
This benchmark measures whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. For this benchmark, GPT-4 seems to perform the best.
MATH - Arithmetic Reasoning
paper | dataset [released 2021]
MATH is a new benchmark, that has a dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. The authors of this benchmark found out that increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning, if scaling trends continues. Check how current models stack up on this benchmark.
GSM8K - Arithmetic Reasoning
paper | dataset [released 2021]
This dataset consists of 8.5K high quality linguistically diverse grade school math word problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem, but some models find these tasks still challenging.
There are two widely used benchmarks for evaluating human preference when it comes to chatbot assistance.
Chatbot Arena
Developed by the LMSYS organization, the Chatbot Arena is a crowdsourced open platform for LLM evals. So far they’ve collected over 200K human preference votes to rank LLMs in with the Elo ranking system.
How it works: You ask a question to two anonymous AI models (like ChatGPT, Claude, or Llama) without knowing which is which. After receiving both answers, you vote for the one you think is better. You can keep asking questions and voting until you decide on a winner. Your vote only counts if you don't find out which model provided which answer during the conversation.
MT Bench
MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants with LLM-as-a-judge. To automate the evaluation process, they prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
HumanEval - Coding Benchmark
paper | dataset [released 2021]
This is the most used benchmark to evaluate the performance of LLMs in code generation tasks.
The HumanEval Dataset has a set of 164 handwritten programming problems that evaluate for language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem. Learn how LLMs compare on this task.
MBPP - Coding Benchmark for Python problems
paper | dataset [ released 2021]
The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases (see how LLMs compare)
There are two major limitations of current LLM benchmarks:
1. Restricted scope
Many benchmarks have restricted scope, usually targeting capabilities on which LLMs have already proven some proficiency. Because they focus on areas where language models are known to be good, they're not great at finding new or unexpected skills that may emerge as language models get more advanced.
2. Short life span
Also, benchmarks for language modeling often don't last long in terms of usefulness. Once language models reach a level of performance that's as good as humans on these benchmarks, the benchmarks are typically either stopped and swapped out or updated by adding harder challenges.
This short lifespan is likely because these benchmarks don't cover tasks that are much harder than what current language models can do.
It’s clear that as models continue to improve, they will achieve increasingly similar and higher scores on current benchmarks. So, we'll need to test models on future capabilities that are not possible now - with benchmarks like BBHard.
BigBench - Predicting future potential
paper | dataset [released 2023]
BIG-bench is created to test the present and near-future capabilities and limitations of language models, and to understand how those capabilities and limitations are likely to change as models are improved.
This evaluation currently consists of 204 tasks that are believed to be beyond the capabilities of current language models. These tasks were contributed by 450 authors across 132 institutions, and the topics topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond.
These are the most commonly utilized LLM Benchmarks among models’ technical reports:
- MMLU - Multitask accuracy
- HellaSwag - Reasoning
- HumanEval - Python coding tasks
- BBHard - Probing models for future capabilities
- GSM-8K - Grade school math
- MATH - Math problems with 7 difficulty levels
Here's how the top 10 LLM models rank on these benchmarks:
In summary, at the time of writing this blog post (March 2024), the TL;DR of this leaderboard indicates:
- Claude 3 Opus has the best average score across all benchmarks, and Gemini 1.5 Pro is right after (although this model isn't yet released);
- GPT-4 is starting to fall behind Gemini and Claude models (I guess we're waiting for GPT-5);
- Claude 3 Sonnet and Haiku are showing better results than GPT-3.5 (they're in the same category of cheaper and faster models);
- Mixtral 8x7B is the open-source model that has best average scores across all benchmarks for open source models;
Want to evaluate your AI apps?
This data is useful when choosing a model for a general task, but if you’d like to evaluate with your own benchmark data, then we can help!
Vellum has the tooling to support the entire lifecycle management of prompts from prototyping to production. The Evaluation feature can help with testing prompts in the prototyping phase, and then evaluating with the deployed prompt, and checking for any regressions.
If you’re interested to setup your own Evaluation suite, let us know at support@vellum.ai or book a call on this link.