Read our latestState of AI development report– explore trends, usage, and emerging patterns!Read the results →
All Posts
May 13, 2025
Guides
How to continuously improve your AI Assistant using Vellum
5 min

Capture edge cases in production and fix them in couple of minutes without redeploying you application.

At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.

The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:

  • Start warranty claims
  • Check on existing claims
  • Understand what their warranty covers
  • And (when needed) request refunds

The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.

But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.

Here’s how we caught the problem and fixed it using Vellum.

Quick Demo

The AI workflow behind the assistant

Here’s how this AI workflow is wired:

  • One prompt classifies user intent across four tools:
    • start_claim
    • check_claim
    • understand_warranty
    • issue_refund
  • Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
  • The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
  • After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.

In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.

Take a look at how it was orchestrated in the preview below:

Click to Interact

The Problem: Wrong function call

Now let’s say that our customers were chatting with the agent saying things like:

“I broke my headphones.”

No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.

But then someone tried: “Give me a refund now.”

And the bot said: “Sure. Here’s $1,500.”

So in this case, the intent classifier was too eager. It saw “refund” and jumped straight to calling the issue_refund tool that we had within our “Intent Classifier” without confirming anything.

In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.

So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?

The Solution

The best thing about using Vellum to define your AI workflows is that you’ll have a pretty good infra to continuously improve your system in production. Here’re a few quick steps of what you can do to fix a problem in production, and reliably improve the performance.

Step 1: Capture what went wrong

Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:

  • The tool calls
  • The inputs and outputs
  • The full stack of prompts, responses, and decisions

We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Preview of the tracing view in Vellum where we can preview all executions for a given workflow

Step 2: Capture as a scenario

So once you see an undesirable execution like this, in the Vellum Execution log you can save it as “Scenario”. This will basically take the exact situation that you just saw in production and it will save it as a scenario that you can run and test against your workflow:

Saving a scenario from production

Step 3: Fix the issue

Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:

“Do not issue a refund unless there is proof of approval.”

No code needed. No SDKs. Just an update to the system prompt.

Then we re-ran the scenario, and this time, the refund wasn’t triggered. That one line stopped the bot from auto-approving money requests. This gave us confidence that the fix worked, based on the exact interaction that failed in production.

We’ve made the change in the Visual builder, but you want to involve your engineers to make that change for you, you can easily use the “SDK preview” that powers this Workflow and make the changes the code and push it back up:

The UI builder and the SDK representation of the workflows

Step 4: Push the Fix Live

From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:

  • We didn’t need to rebuild or redeploy the app
  • We didn’t need to coordinate with backend engineers
  • The bot immediately started using the updated logic

This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].

Try Vellum today

Vellum Workflow Builder: link

Vellum SDK: link

Warranty bot: link

ABOUT THE AUTHOR

David Vargas
Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.

The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:

  • Start warranty claims
  • Check on existing claims
  • Understand what their warranty covers
  • And (when needed) request refunds

The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.

But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.

Here’s how we caught the problem and fixed it using Vellum.

Quick Demo

The AI workflow behind the assistant

Here’s how this AI workflow is wired:

  • One prompt classifies user intent across four tools:
    • start_claim
    • check_claim
    • understand_warranty
    • issue_refund
  • Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
  • The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
  • After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.

In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.

Take a look at how it was orchestrated in the preview below:

Click to Interact

The Problem: Wrong function call

Now let’s say that our customers were chatting with the agent saying things like:

“I broke my headphones.”

No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.

But then someone tried: “Give me a refund now.”

And the bot said: “Sure. Here’s $1,500.”

So in this case, the intent classifier was too eager. It saw “refund” and jumped straight to calling the issue_refund tool that we had within our “Intent Classifier” without confirming anything.

In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.

So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?

The Solution

The best thing about using Vellum to define your AI workflows is that you’ll have a pretty good infra to continuously improve your system in production. Here’re a few quick steps of what you can do to fix a problem in production, and reliably improve the performance.

Step 1: Capture what went wrong

Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:

  • The tool calls
  • The inputs and outputs
  • The full stack of prompts, responses, and decisions

We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Preview of the tracing view in Vellum where we can preview all executions for a given workflow

Step 2: Capture as a scenario

So once you see an undesirable execution like this, in the Vellum Execution log you can save it as “Scenario”. This will basically take the exact situation that you just saw in production and it will save it as a scenario that you can run and test against your workflow:

Saving a scenario from production

Step 3: Fix the issue

Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:

“Do not issue a refund unless there is proof of approval.”

No code needed. No SDKs. Just an update to the system prompt.

Then we re-ran the scenario, and this time, the refund wasn’t triggered. That one line stopped the bot from auto-approving money requests. This gave us confidence that the fix worked, based on the exact interaction that failed in production.

We’ve made the change in the Visual builder, but you want to involve your engineers to make that change for you, you can easily use the “SDK preview” that powers this Workflow and make the changes the code and push it back up:

The UI builder and the SDK representation of the workflows

Step 4: Push the Fix Live

From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:

  • We didn’t need to rebuild or redeploy the app
  • We didn’t need to coordinate with backend engineers
  • The bot immediately started using the updated logic

This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].

Try Vellum today

Vellum Workflow Builder: link

Vellum SDK: link

Warranty bot: link

ABOUT THE AUTHOR

David Vargas
Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect
Related Posts
View More

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.