Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!
Understanding your agent’s behavior in production
7 min

You can’t improve what you can’t see, so start tracking every decision your agent makes.

Author
Anita Kirkovska
Author
No items found.
Aug 5, 2025
Guides
No items found.

With traditional software, observability solutions usually helps you catch crashes, errors, or slow responses. Because the logic is mostly deterministic, it’s usually easy to trace the problem.

With AI agents, however, you're trying to understand the decisions behind each action, not just the outcomes. That means asking questions like:

  • Why did the model respond that way?
  • What context was it using?
  • Which tools did it call, and why?
  • Did it follow the right path through the workflow?
  • Was the final output grounded in reality, or did it hallucinate?
  • Is the agent improving over time, or getting worse?

These are not questions logs and metrics alone can answer.

You need to capture the full trace of a given agent run: its prompts, retrieved docs, tool invocations, latency, cost, sub-agent executions, outputs, and user feedback among other things.

This post shares the core concepts around AI observability that we’ve learned from working with hundreds of customers. If you're trying to understand your own AI agent, I hope these insights help you debug and optimize faster.

AI agents break quietly

With AI apps every 1% matters

This is the fundamental challenge of AI observability: every decision is probabilistic, and these probabilities compound. You're dealing with branching probability trees where each node influences every subsequent decision. A slight misinterpretation at step one becomes a wrong retrieval at step two becomes a hallucinated policy at step three:

Every % improvement compounds

Every 1% improvement matters. From the image above, if you have an agent with 10 steps and each step has 99% accuracy, you’re looking at 90% overall accuracy. If each step has 97% accuracy, then you’ve dropped to 72% overall accuracy.

To optimize this agent, you need to look at everything. Every input, tool invocation, intermediate decisions and  everything else that shaped the result.

You can’t control what you can’t see

If we go deeper, we’ll see that each decision point has its own hidden variables.  Each step can have a different model setup: temp settings, context window, topK/topP parameters, function calling outcomes, context reranking. And the list goes on.

Change any of these, even slightly, and the entire downstream behavior shifts.

More agents, more decisions, more chaos

As you add complexity, especially with multiple agents making decisions from different inputs, the chances of something going off-track increase quickly.

Without observability, it’s almost impossible to know why an agent made a specific decision, where things started to drift, or how one subtle change created downstream failures.

What to track when your agent goes live

The most effective product engineering teams are setting up systems that will help them understand the AI agent behavior, execution and errors. Below we go through the key components that you’ll need to build (or buy) in order to observe your AI system end to end.

Execution-level reports

When you’re building AI agents, you’ll probably need to rethink what you want to measure. Yes, you need the basics: request latency, throughput, error rates. But you also need AI-specific metrics: token usage, cost, tool invocation success, context relevance, among others.

As you start to build more complex agents, you should probably track things like retry rates, user feedback, decision branches, context utilization, and more.

Having a report like the one below makes it easier to spot issues fast. You can quickly see if something failed, and why it failed:

Execution ID Model Input Tokens Output Tokens Tool Used Tool Success TTFT (s) Total Latency (s) Retries Cost ($) Status
exec_001 gpt-4 108 47 get_weather ✅ Yes 0.7 2.5 0 0.00585 ✅ Success
exec_002 claude 4.1 92 42 search_flights ❌ No 1.1 3.3 1 0.00462 ❌ Failed
exec_003 gpt-oss 130 60 None N/A 0.5 1.5 0 0.00690 ✅ Success

Full Trace Visibility

Now, execution-level reports are the first place to check when something goes wrong.

But to really understand what happened, you’ll need to create detailed logs for each step of the agent workflow. These logs should capture every input and response, along with the hidden variables behind each step of the agent workflow. If the agent called tools or other sub-agents, those should be recorded too.

This data becomes valuable when tracking down why your model chose one path over another.

For example, here’s a simple tracing table that shows every action the agent took including some nested sub-agents it called:

Step What Happened What Was Called / Used Tokens (in/out) Latency (ms) Est. Cost ($)
1 User Input: "What’s the weather like in London?" ➡️ Appended to messages 9 / 0 ~0 0.00027
2 RAG Context Injected: "User is traveling to London..." ➡️ get_rag_context(user_input) 10 / 0 50 0.00030
3 Model Invocation (decides to call tool) ➡️ ChatCompletion.create(...) 50 / 12 300 0.00186
└─ 3.1 Tool Call Requested: get_weather(location="London") ↳ Called from model
└─ 3.2 Tool Execution: returns "72°F and sunny in London" ➡️ get_weather("London") 0 / 9 100 0.00054
4 Final Model Completion (sees tool result) ➡️ 2nd ChatCompletion.create(...) 60 / 20 400 0.00300
5 Output to user: "It looks like it’s currently 72°F..." Final model response

Basically, each of these traces should capture the full graph of operations. When your agent decides to call a weather API, then uses that data to query a database, then synthesizes the results, that entire flow needs to be visible as a single trace. If your agent evaluates three possible tools and chooses one, show all three evaluations in the trace, not just the chosen path.

Visual Execution Tracing

The previous two observability tables show you when things go wrong, and why a model chooses a given path. But, as you build more complex agents, debugging these logs in a table, and/or your IDE logs can be very tough to parse.

There is a growing need from both engineers, and other stakeholders (e.g. PMs, Legal, Management) to debug these agents in a visual graph. In it you can easily visualize the decision tree that the model executed on, which tools were considered, why specific paths were taken, how confidence scores evolved through the conversation.

Here’s how that might look like for you:

User-in-the-loop

Finally, no matter the observability data you collect, today’s AI agents still need human input.  And the best input you can give to them is your user’s.

To integrate user feedback into your setup, you can start by adding a simple feature that asks users if the response was helpful and invite them to explain why or why not. It can be explicit feedback in a form of a thumbs up or down, or implicit feedback (e.g. how much time did they wait for a ticket to be resolved). You can then feed this back into your evaluation datasets or trigger re-runs for failed cases so you can easily evaluate what happened for a particular user.

Loud alerts for quiet failures

It’s helpful to regularly review your agent’s behavior, but when something goes wrong in production, you need a system that tells you immediately, before your users report it.

Within this system you can track things like:

  • Retry spikes: How often are model calls or tools failing and needing to run again?
  • Latency outliers: P99 or P100 latency far above your P50 is usually a sign something’s off.
  • Cost anomalies: A few unexpectedly long completions can skew your budget fast.
  • Tool failure rates: Are certain tools returning errors more frequently?
  • Empty or truncated outputs: Often a sign that context limits were hit or generation failed silently.

All of these components should help your team feel more confident about making changes and keep improving the agent over time.

Build vs buy: AI observability

Now that we’ve walked through the core components for observing your AI agent in production, it’s time to tackle the big question: should you build all of this yourself, or use a platform that’s designed for this purpose?

Why some teams build

Engineering-heavy orgs might already have internal tracing systems or want full control over their observability stack. If your team has the bandwidth to maintain custom tools and your use case is narrow and well-defined, building in-house can work. Especially if you want to tightly couple it with proprietary systems.

But building all of these components will require a lot of infra maintaing work. You’ll need to:

  • Handle prompt versioning
  • Log tool calls and retries
  • Build visual trace viewers
  • Integrate user feedback
  • Maintain cost and performance dashboards

Even then, you’re just recreating what existing platforms already offer, without the speed or scale.

Why most teams buy

AI observability is a fast-moving space. Buying gets you all features that you’ll ever need, out of the box:

  • Full tracing for every model call, tool use, cost and decision
  • Visual execution graphs
  • Integrated feedback loops
  • Prompt / Agent comparisons
  • Alert systems
  • Support for multiple LLM providers

So if you’re experimenting with a small proof of concept, building might be fine. But if you’re running AI in production, especially with complex agents, external tools, or high customer impact, it usually makes more sense to buy.

Try Vellum

At Vellum, we built observability in from the start because teams kept telling us the same thing: it was too hard to understand what their agents were doing. Debugging took too long, issues were hard to track down, and sharing what happened with teammates felt messy.

So we made observability part of the workflow from day one. Whether you're testing a new agent workflow or monitoring another one in production, you can see exactly what happened and why.

Check the image below of the level of details you can get for a given execution:

With observability built into the product, we’ve helped customers move faster and quickly improve their AI apps in production. For example, Rely Health, a growing care navigation AI company, cut their time to resolution by 100× across all AI workflows that they’ve deployed via Vellum, and that are used hundreds of care navigators.

"We create dozens of AI workflows; easily 7-8 per client, covering patient phone calls, provider calls, and more. Vellum has been a huge help because we need to rapidly iterate and deploy. We can take a given execution, drop it into our evaluation table, and keep moving fast." - Prithvi, CTO at Rely Health.

Here’s a bit of more context of how we enable this for Rely Health:

  • Observe as you build: With Vellum workflows you are able to trace all the executions even as you’re prototyping your agent. This is very useful as it allows you to easily debug even before you put your system in production.
  • End-to-end Tracing: Vellum tracks the full execution trace for you: inputs, outputs, latency, tool invocations, token usage, costs, and nested sub-agent executions. You don’t need to set anything up manually. Once an execution happens in production, you can easily replay it and analyze the specific steps that the agent took for that specific instance.
  • Visual tracing: Vellum Workflow previews let you see your entire workflow visually. You can re-run a given execution and can easily debug visually of what happened at each step in the execution flow of your agent. This becomes highly useful when debugging multi-step agents. (More on this here).
  • Proactive debugging: Using the Vellum Monitoring dashboard, you can spot issues before users do with dashboards that flag error spikes, latency jumps, hallucinations, and quality drops.
  • Send user feedback to evals: You can capture user feedback and link it directly in your evaluation table. From there you can run evaluations with real-world feedback and executions and have the ability to continuously improve your system.

Conclusion

When it comes to AI agents, observability is a requirement. These are probabilistic systems that can appear to work fine while quietly making bad decisions. And once they’re in production, the cost of not knowing why something broke, or worse, not knowing it broke at all, can add up fast.

But the bottom line is that you don’t need to track everything. But you do need to track what matters for your use-case. And you need to start before something breaks, not after.

I hope this post will help you start with it. You can always ask us for help here: Request AI expert help with Vellum.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.
Share Post
Related Posts
Guides
August 7, 2025
7 min
GPT-5 Benchmarks
Model Comparisons
August 6, 2025
7 min
OpenAI o3 vs gpt-oss 120b
Guides
July 27, 2025
Subliminal Learning in LLMs
Guides
July 15, 2025
7 min
Why ‘Context Engineering’ is the New Frontier for AI Agents
Product Updates
July 18, 2025
7 min
Introducing Vellum Copilot
Guides
June 3, 2025
5 min
10 Humanloop Alternatives in 2025
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.