At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.
The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:
- Start warranty claims
- Check on existing claims
- Understand what their warranty covers
- And (when needed) request refunds
The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.
But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.
Here’s how we caught the problem and fixed it using Vellum.
Quick Demo
The AI workflow behind the assistant
Here’s how this AI workflow is wired:
- One prompt classifies user intent across four tools:
start_claim
check_claim
understand_warranty
issue_refund
- Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
- The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
- After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.
In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.
Take a look at how it was orchestrated in the preview below:
The Problem: Wrong function call
Now let’s say that our customers were chatting with the agent saying things like:
“I broke my headphones.”
No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.
But then someone tried: “Give me a refund now.”
And the bot said: “Sure. Here’s $1,500.”
So in this case, the intent classifier was too eager. It saw “refund” and jumped straight to calling the issue_refund
tool that we had within our “Intent Classifier” without confirming anything.
In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.
So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?
The Solution
The best thing about using Vellum to define your AI workflows is that you’ll have a pretty good infra to continuously improve your system in production. Here’re a few quick steps of what you can do to fix a problem in production, and reliably improve the performance.
Step 1: Capture what went wrong
Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:
- The tool calls
- The inputs and outputs
- The full stack of prompts, responses, and decisions
We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Step 2: Capture as a scenario
So once you see an undesirable execution like this, in the Vellum Execution log you can save it as “Scenario”. This will basically take the exact situation that you just saw in production and it will save it as a scenario that you can run and test against your workflow:

Step 3: Fix the issue
Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:
“Do not issue a refund unless there is proof of approval.”
No code needed. No SDKs. Just an update to the system prompt.
Then we re-ran the scenario, and this time, the refund wasn’t triggered. That one line stopped the bot from auto-approving money requests. This gave us confidence that the fix worked, based on the exact interaction that failed in production.
We’ve made the change in the Visual builder, but you want to involve your engineers to make that change for you, you can easily use the “SDK preview” that powers this Workflow and make the changes the code and push it back up:

Step 4: Push the Fix Live
From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:
- We didn’t need to rebuild or redeploy the app
- We didn’t need to coordinate with backend engineers
- The bot immediately started using the updated logic
This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].
Try Vellum today
Vellum Workflow Builder: link
Vellum SDK: link
Warranty bot: link
At Vellum, we built a sample warranty claims chatbot to show how teams can use our platform to build, test, and manage LLM workflows in production.
The bot simulates a customer service agent for an electronics store (Acme Electronics). It helps users:
- Start warranty claims
- Check on existing claims
- Understand what their warranty covers
- And (when needed) request refunds
The flow is powered by a custom intent classifier and several tools wired together using Vellum Workflows.. It’s easy to deploy, inspect, and update, without having to change the app code.
But during a live demo, we showed what can happen when you don’t test your workflows carefully: the bot started approving huge refunds without any checks.
Here’s how we caught the problem and fixed it using Vellum.
Quick Demo
The AI workflow behind the assistant
Here’s how this AI workflow is wired:
- One prompt classifies user intent across four tools:
start_claim
check_claim
understand_warranty
issue_refund
- Each tool has a conditional “port” attached to it, so execution only routes there if the function call name matches.
- The tools themselves are basic code blocks (for now), but they could be DB queries, API calls, or any backend logic you want.
- After the tool runs, the output is piped into another prompt that turns the raw function response into a message to the user.
In the Vellum Workflow builder you can see every input and output along the way and you can easily test individual nodes as you build your workflow.
Take a look at how it was orchestrated in the preview below:
The Problem: Wrong function call
Now let’s say that our customers were chatting with the agent saying things like:
“I broke my headphones.”
No problem, our agent classifies this as a claim creation, asks for product info and order number, and files a warranty claim.
But then someone tried: “Give me a refund now.”
And the bot said: “Sure. Here’s $1,500.”
So in this case, the intent classifier was too eager. It saw “refund” and jumped straight to calling the issue_refund
tool that we had within our “Intent Classifier” without confirming anything.
In the demo, this was just a hardcoded return. But if this had been a real system with access to actual backend APIs or payment processors, it would’ve been dangerous.
So now the question is: how do you fix something like this fast, without dragging engineering back into a full re-deploy?
The Solution
The best thing about using Vellum to define your AI workflows is that you’ll have a pretty good infra to continuously improve your system in production. Here’re a few quick steps of what you can do to fix a problem in production, and reliably improve the performance.
Step 1: Capture what went wrong
Because the agent was built in Vellum, we could trace the exact execution path using the Vellum Observability tools:
- The tool calls
- The inputs and outputs
- The full stack of prompts, responses, and decisions
We opened the execution log, saw that it jumped straight to refund, which is not desirable because we don’t want our assistant to so easily give refunds:

Step 2: Capture as a scenario
So once you see an undesirable execution like this, in the Vellum Execution log you can save it as “Scenario”. This will basically take the exact situation that you just saw in production and it will save it as a scenario that you can run and test against your workflow:

Step 3: Fix the issue
Next, we pulled up the original classifier prompt inside Vellum’s sandbox and made a small change:
“Do not issue a refund unless there is proof of approval.”
No code needed. No SDKs. Just an update to the system prompt.
Then we re-ran the scenario, and this time, the refund wasn’t triggered. That one line stopped the bot from auto-approving money requests. This gave us confidence that the fix worked, based on the exact interaction that failed in production.
We’ve made the change in the Visual builder, but you want to involve your engineers to make that change for you, you can easily use the “SDK preview” that powers this Workflow and make the changes the code and push it back up:

Step 4: Push the Fix Live
From there, it was just a click to deploy the new version. Because Vellum hosts the workflow endpoint:
- We didn’t need to rebuild or redeploy the app
- We didn’t need to coordinate with backend engineers
- The bot immediately started using the updated logic
This is a big deal. You get to ship changes in minutes, not days and is the reason why many of our customers are able to move fast [ RelyHealth, Woflow].
Try Vellum today
Vellum Workflow Builder: link
Vellum SDK: link
Warranty bot: link
Latest AI news, tips, and techniques
Specific tips for Your AI use cases
No spam
Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Experiment, Evaluate, Deploy, Repeat.
AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.
