LLMs have opened up a whole new class of software applications (check our our blog post on Great (and not so great) use cases of LLMs here). However, evaluating the quality of their outputs at scale is hard.
Despite the great versatility demonstrated in side projects / hackathons, we’ve seen companies struggle to put LLMs in production. One of the common reasons is that there’s no framework to evaluate quality of these models. LLMs are inherently probabilistic in nature — the same input can have different outputs depending on the probabilities assigned by the model when using a temperature of > 0, and seemingly small changes can result in vastly different outputs.
Having built applications on LLMs for more than 3 years now, we know how important measuring output quality is for creating a great user experience. There needs to be sufficient unit testing before going into production, and regression testing to make changes once in production.
In this post we’ll share our learnings on the best ways to measure LLM quality both before and after deploying to production.
In order to measure quality, the metrics you choose depend on the type of use case. There are various options:
This is the easiest use-case to quantitatively measure quality because there’s usually a correct answer! For classification use cases, we recommend building up a bank of a few hundred test cases (where each case is something like “given these inputs, I expect this output”). The stats you should be looking for are: accuracy, recall, precision and F score. To dive deeper into the results, you may also want to see confusion matrices to understand where the model is making mistakes.
The output here is usually a machine readable format. JSON is a good choice here. There are various kinds of tests you can perform to measure quality while testing:
The challenge with evaluating creative output is that there is no one correct answer. When you have an example of a good output, semantic similarity can be used as a proxy for quality before productionizing your use case. “How similar in meaning is this response to the target response?” Cross-encoders are best suited to measure the similarity between the expected and actual output. If the temperature for these prompts is > 0, make sure to run each model/prompt/test case combination multiple times so you can see the variance in semantic similarity.
One approach we’re excited about and just in the early stages of testing is using one LLM to evaluate the quality of another. This is definitely experimental, but a good prompt can check for tone, accuracy, language etc.
User feedback is the ultimate source of truth for model quality — if there’s a way for your users to either implicitly or explicitly tell you whether they the response is “good” or “bad,” that’s what you should track and improve! High quality input/output pairs captured in this way can also be used to fine tune models (you can read more about our thoughts on fine tuning here).
Explicit user feedback is collected when your users respond with something like a 👍 or 👎 in your UI when interacting with the LLM output. Asking explicitly may not result in enough volume of feedback to measure overall quality. If your feedback collection rates are low, we suggest using implicit feedback if possible.
Implicit feedback is based on how users react to the output generated by the LLM. For example, if you generate a first draft of en email for a user and they send it without making edits, that’s likely a good response! If they hit regenerate, or re-write the whole thing, that’s probably not a good response. Implicit feedback collection may not be possible for all use cases, but it can be a powerful gauge of quality.
So far we’ve spoken about different ways of measuring LLM quality. Here are a few more ideas on tactically getting these models into production:
Measuring LLM quality is challenging. Unit testing with a large test bank, choosing the right evaluation metric, and regression testing when making changes to prompts in production are all worthwhile strategies. Unfortunately, the tooling and infra needed to do this at scale usually requires significant engineering resources dedicated to building internal tooling.
Vellum’s platform for building production LLM application aims to solve just that. We provide the tooling layer to experiment with prompts and models, evaluate their quality, and make changes with confidence once in production. If you’re interested, you can request early access here! You can also subscribe to our blog and stay tuned for updates from us. We’ll be announcing a new part of the Vellum platform that we call Test Suites soon!