Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

How to Evaluate Your RAG System?

Learn how to use retrieval and content generation metrics to consistently evaluate and improve your RAG system.

Written by
Reviewed by
No items found.

Retrieval Augmented Generation (RAG) is a powerful technique that enhances output quality by retrieving relevant context from an external vector database.

However, building and evaluating a RAG system can be challenging, especially when it comes to measuring performance.

In this post, we'll explore the most effective metrics for each stage of your RAG pipeline and how to use them to evaluate your whole system.

What is RAG Evaluation?

When evaluating your RAG, you're essentially checking how effectively your system retrieves relevant information from a knowledge base and uses it to produce reliable and precise responses or content.

Running these evaluations are very useful when you’re building your first RAG version; but the benefit of running these evaluations continue post-development. Running these evals in production will help you understand your system's current performance relative to the potential improvements you could achieve by modifying your prompts.

It’s a never ending process, and without doing it, there’s no way to know if your RAG system is performing optimally or needs adjustments.

But, how to actually do it?

How to Evaluate Your RAG System?

When evaluating your RAG system, you should pressure test the two most important parts: retrieval and content generation. However, don't overlook the significance of measuring all other aspects of your RAG that contribute to the underlying business logic of your system.

So, what exactly are we evaluating? Let's break it down:

1.Context retrieval

When you’re evaluating your “Context Retrieval” segment you’re essentially trying to figure out whether you can consistently retrieve the most relevant knowledge from a large corpus of text based on the optimal combination of chunking strategy, embedding model, and search algorithm.

2. Content generation

Evaluating the quality of generated content is basically running experiments with various prompts and models, with a goal of using metrics such as faithfulness/relevancy to determine that, given the most relevant retrieved knowledge, you produce a reasonable generated answer.

3. Business logic

The first two are must-haves. However, keep in mind that you should evaluate other parts of your AI workflow, that are important for your use-case and business logic. Intent verification, output length, rule compliance are some of the many metrics that businesses are using to evaluate important segments in their RAG pipelines.

To perform some of these evaluations you’ll either need human-annotated ground truth data to compare against, or you can use another LLM to synthetically generate that data for you, or you can evaluate your outputs on the spot (GPT-4 is very capable for this task, and it’s already widely used in the NLP community).

Now let’s look at the metrics that you should care about when performing these evaluations.

RAG Evaluation Metrics

Finding the best ways to measure success for your RAG systems is forever changing and an active field of research. But there are some metrics that are proving to be useful for production-grade AI apps.

Context Retrieval Evaluation

To evaluate which retrieval setup produces the best results, you can use the following evaluators:

  1. Context relevance - How relevant is the context to the question?
  2. Context adherence - Are the generated answers based on the retrieved context and nothing else ?
  3. Context recall - Is the context accurate compared to the ground truth data to give an answer?

Content Generation Evaluation

Once you have a good semantic search process, you can start testing different prompts and models. Here are some frequent evaluation metrics:

  1. Answer Relevancy: How relevant is the answer to the question at hand?
    For example, if you ask: “What are the ingredients in a peanut butter and jelly sandwich and how do you make it?" and the answer is "You need peanut butter for a peanut butter and jelly sandwich," this answer would have low relevancy. It only provides part of the needed ingredients and doesn't explain how to make the sandwich.
  2. Faithfulness: How factually accurate is the answer given the context?
    You can mark an answer as faithful if all the claims that are made in the answer can be inferred from the given context. This can be evaluated on a (0,1) scale, where 1 is high faithfulness
  3. Correctness: How accurate is the answer against the ground truth data?
  4. Semantic similarity: How closely does the answer match the context in terms of meaning (semantics)?

These are just a few examples, but remember, there are many different methods to evaluate your RAG systems based on what's important for your business.

The key is to create a testing process to measure the system's effectiveness and reliability before it goes live. Additionally, once real users interact with it, you can gather their feedback to enhance the system by applying the same testing methods.

To do this best, you need the right tooling - and that's where we can help.

Evaluating your RAG with Vellum

Using Vellum, you can create custom evaluators to evaluate every step in your RAG system. On top of that, our Evaluation Reports enables you to look at absolute and relative performance across various metrics like mean, median, p10 and p90.

For instance, when creating a customer support chatbot using Vellum's Workflow builder, you can set up all required RAG steps, and the evaluation mechanisms:

  • Search: Initialize the vector database;
  • Retrieval Evaluators: Create custom evaluators to check how accurately the chatbot retrieves context;
  • Content Evaluators: Test out various prompts and models, including every model available, whether it's proprietary or open-source;
  • Business logic eval: Create evaluators for business logic; build them scratch or use LLM-based evaluation when needed.
  • Deploy: Launch your chatbot in production and capture implicit/explicit end user feedback. Then use this feedback as your baseline data to further evaluate your system;
  • Continuously improve: Regularly conduct evaluation reports to ensure ongoing trust in your RAG system.

If you’d like to learn more on how to build this with Vellum, book a call here, or reach out to us at support@vellum.ai.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Mar 8, 2024
share post
Expert verified
Related Posts
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
Guides
October 6, 2025
15
A practical guide to AI automation
LLM basics
September 25, 2025
8 min
Top Low-code AI Agent Platforms for Product Managers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Prior authorization navigator
Automate the prior authorization process for medical claims.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Q&A RAG Chatbot with Cohere reranking

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.