How to continuously improve your AI Assistant using Vellum

CONTENTS

Inline evaluation / Guardrails: Ensure good system performance at run-time

This is some text inside of a div block.

Capture edge cases in production and fix them in couple of minutes without redeploying you application.

Author

David Vargas

Anita Kirkovska

May 13, 2025

At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.

The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:

Start warranty claims
Check on existing claims
Understand what their warranty covers
And (when needed) request refunds

The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.

But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.

Here’s how we caught the problem and fixed it using Vellum.

Quick Demo

‍

The AI workflow behind the assistant

Here’s how this AI workflow is wired:

One prompt classifies user intent across four tools:
- start_claim
- check_claim
- understand_warranty
- issue_refund
Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.

In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.

Take a look at how it was orchestrated in the preview below:

Click to Interact

‍

The Problem: Wrong function call

Now let’s say that our customers were chatting with the agent saying things like:

“I broke my headphones.”

No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.

But then someone tried: “Give me a refund now.”

And the bot said: “Sure. Here’s $1,500.”

So in this case, the intent classifier was too eager. It saw “refund” and jumped straight to calling the issue_refund tool that we had within our “Intent Classifier” without confirming anything.

In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.

So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?

The Solution

The best thing about using Vellum to define your AI workflows is that you’ll have a pretty good infra to continuously improve your system in production. Here’re a few quick steps of what you can do to fix a problem in production, and reliably improve the performance.

Step 1: Capture what went wrong

Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:

The tool calls
The inputs and outputs
The full stack of prompts, responses, and decisions

We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Preview of the tracing view in Vellum where we can preview all executions for a given workflow

Step 2: Capture as a scenario

So once you see an undesirable execution like this, in the Vellum Execution log you can save it as “Scenario”. This will basically take the exact situation that you just saw in production and it will save it as a scenario that you can run and test against your workflow:

Step 3: Fix the issue

Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:

“Do not issue a refund unless there is proof of approval.”

No code needed. No SDKs. Just an update to the system prompt.

Then we re-ran the scenario, and this time, the refund wasn’t triggered. That one line stopped the bot from auto-approving money requests. This gave us confidence that the fix worked, based on the exact interaction that failed in production.

We’ve made the change in the Visual builder, but you want to involve your engineers to make that change for you, you can easily use the “SDK preview” that powers this Workflow and make the changes the code and push it back up:

The UI builder and the SDK representation of the workflows

Step 4: Push the Fix Live

From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:

We didn’t need to rebuild or redeploy the app
We didn’t need to coordinate with backend engineers
The bot immediately started using the updated logic

This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].

Try Vellum today

Vellum Workflow Builder: link

Vellum SDK: link

At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.

The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:

Start warranty claims
Check on existing claims
Understand what their warranty covers
And (when needed) request refunds

The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.

But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.

Here’s how we caught the problem and fixed it using Vellum.

Quick Demo

‍

The AI workflow behind the assistant

Here’s how this AI workflow is wired:

One prompt classifies user intent across four tools:
- start_claim
- check_claim
- understand_warranty
- issue_refund
Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.

In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.

Take a look at how it was orchestrated in the preview below:

Click to Interact

‍

The Problem: Wrong function call

Now let’s say that our customers were chatting with the agent saying things like:

“I broke my headphones.”

No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.

But then someone tried: “Give me a refund now.”

And the bot said: “Sure. Here’s $1,500.”

In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.

So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?

The Solution

Step 1: Capture what went wrong

Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:

The tool calls
The inputs and outputs
The full stack of prompts, responses, and decisions

We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Step 2: Capture as a scenario

Step 3: Fix the issue

Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:

“Do not issue a refund unless there is proof of approval.”

No code needed. No SDKs. Just an update to the system prompt.

Step 4: Push the Fix Live

From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:

We didn’t need to rebuild or redeploy the app
We didn’t need to coordinate with backend engineers
The bot immediately started using the updated logic

This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].

Try Vellum today

Vellum Workflow Builder: link

Vellum SDK: link

ABOUT THE AUTHOR

David Vargas

Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

Anita Kirkovska

Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

Guides

August 14, 2025

•

How to write effective prompts for GPT-5

Guides

August 12, 2025

•

6 min

Partnering with Composio to Help You Build Better AI Agents

Product Updates

August 12, 2025

•

Vellum Product Update | July

Guides

August 8, 2025

•

Best practices for building AI multi agent systems

Guides

August 7, 2025

•

7 min

GPT-5 Benchmarks

Model Comparisons

August 6, 2025

•

7 min

OpenAI o3 vs gpt-oss 120b

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska

Head of Engineering