February 14, 2024

How to Evaluate the Quality of LLMs for Production Use-Cases

Akash Sharma
No items found.

LLMs have opened up a whole new class of software applications.

However, evaluating the quality of their outputs at scale is hard.

Despite the great versatility demonstrated in side projects / hackathons, we’ve seen companies struggle to put LLMs in production.

One of the common reasons is that there’s no framework to evaluate quality of these models. LLMs are inherently probabilistic in nature — the same input can have different outputs depending on the probabilities assigned by the model when using a temperature of > 0, and seemingly small changes can result in vastly different outputs.

Having built applications on LLMs for more than 4 years now, we know how important measuring output quality is for creating a great user experience. There needs to be sufficient unit testing before going into production, and regression testing to make changes once in production.

In this post we’ll share our learnings on the best ways to measure LLM quality both before and after deploying to production.

LLM Use Cases

In order to measure quality, the metrics you choose depend on the type of use case. There are various options:

  • Classification: Input text is classified into 2 or more categories. For example, a binary classifier to determine whether an incoming user message should be escalated to a human for review or if an LLM should answer the question.
  • Structured data extraction: Unstructured data from PDFs usually converted to JSON files, to minimize hours of manual data entry work. An example here is converting PDF invoices to JSON.
  • SQL/Code generation: Natural language instructions given to an LLM to produce data that machines can run and usually has a correct answer. Text to SQL is the best example here.
  • Generative/creative output: This is what LLMs are best known for, blog posts, sales emails, song lyrics — there’s no limit here!

LLM Evaluation Metrics

Following the described use-cases, we will explore the relevant evaluation metrics to apply for each scenario.


This is the easiest use-case to quantitatively measure quality because there’s usually a correct answer!

For classification use cases, we recommend building up a bank of a few hundred test cases (where each case is something like “given these inputs, I expect this output”). The stats you should be looking for are: accuracy, recall, precision and F score.

To dive deeper into the results, you may also want to see confusion matrices to understand where the model is making mistakes.

We compared four state of the art model to see which ones are best for text classification, and you can read the technical report here.

Structured Data Extraction

The output here is usually a machine readable format. JSON is a good choice here. There are various kinds of tests you can perform to measure quality while testing:

  • Validate that the output is syntactically valid
  • Ensure expected keys are present in the generated response
  • Flag which keys do/don’t have correct values/types

SQL/Code Generation

For this use-case, you usually want the LLM to generate some SQL/Code generation.

  • Validate that the output is syntactically valid SQL
  • Validate that it can be executed successfully
  • Confirm that the queries return the expected values for defined test cases

Creative Output

The challenge with evaluating creative output is that there is no one correct answer.

When you have an example of a good output, semantic similarity can be used as a proxy for quality before productionizing your use case. “How similar in meaning is this response to the target response?” Cross-encoders are best suited to measure the similarity between the expected and actual output. If the temperature for these prompts is > 0, make sure to run each model/prompt/test case combination multiple times so you can see the variance in semantic similarity.

LLM as a Judge

Another approach is to use another LLM to evaluate the quality of another LLM output.

But are LLM models as good as human evaluators?

The latest research on this topic suggests that using LLMs like GPT-4 to evaluate outputs is a scalable and explainable way to approximate human preference. You can currently use our Workflow Evaluation Metric feature in Vellum to do just that.

Measuring Quality in Production

User feedback is the ultimate source of truth for model quality — if there’s a way for your users to either implicitly or explicitly tell you whether they the response is “good” or “bad,” that’s what you should track and improve!

High quality input/output pairs captured in this way can also be used to fine tune models (you can read more about our thoughts on fine tuning here).

Implicit vs Explicit User Feedback

Explicit user feedback is collected when your users respond with something like a 👍 or 👎 in your UI when interacting with the LLM output. Asking explicitly may not result in enough volume of feedback to measure overall quality. If your feedback collection rates are low, we suggest using implicit feedback if possible.

Implicit feedback is based on how users react to the output generated by the LLM. For example, if you generate a first draft of en email for a user and they send it without making edits, that’s likely a good response! If they hit regenerate, or re-write the whole thing, that’s probably not a good response. Implicit feedback collection may not be possible for all use cases, but it can be a powerful gauge of quality.

Best Practices for Production-Grade AI Apps

So far we’ve spoken about different ways of measuring LLM quality. Here are a few more ideas on tactically getting these models into production:

  • Unit testing with a test bank based on your use case: We suggest at least 50-100 test cases so you feel good about quality. This is the perfect time to test any known edge cases! Read more here, on tips for how to best implement this.
  • Regression testing to validate updates to prompts once in production: Make sure to run backtesting scripts when changing prompts in production, you don’t want to break any existing behavior! Back-testing is possible if you’re capturing the inputs/outputs of production requests. You can “replay” those inputs through your new prompt to see a before and after.
  • Add edge cases noticed in production to your test bank: Your unit test bank shouldn’t be a static list. If you notice an input that produced an undesirable result in production, you should add it to your test bank! Each time you edit a prompt you should be able to make changes with confidence.

Need help with testing & evaluation?

Measuring LLM quality is challenging. Unit testing with a large test bank, choosing the right evaluation metric, and regression testing when making changes to prompts in production are all worthwhile strategies. Unfortunately, the tooling and infra needed to do this at scale usually requires significant engineering resources dedicated to building internal tooling.

Vellum’s platform for building production LLM application aims to solve just that. We provide the tooling layer to experiment with prompts and models, evaluate their quality, and make changes with confidence once in production. If you’re interested, you can request early access here! You can also subscribe to our blog and stay tuned for updates from us.


Join 10,000+ developers staying up-to-date with the latest AI techniques and methods.
Thanks for joining our newsletter.
Oops! Something went wrong.
Akash Sharma
Linkedin's logo

Co-founder & CEO at

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

About the authors

No items found.

Related posts