Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

The Six Levels of Agentic Behavior

A look at AI's evolution from basic, rule-based systems to fully creative agentic workflows.

5 min
Written by
Reviewed by

Quick overview

Agentic behavior in AI refers to how autonomous and decision-capable a system is, ranging from simple task automation to fully autonomous agentic systems.

This article explains each level of agentic behavior so you can identify where your AI systems are today and what it takes to evolve them into agents that act, learn, and improve on their own.

Level Definition Example
L0: Reactive Follows direct instructions with no awareness or learning. A rule-based chatbot or basic script.
L1: Context-Aware Understands limited context and adjusts responses slightly. An assistant that remembers recent inputs.
L2: Goal-Oriented Acts toward defined goals and plans simple steps to reach them. A scheduling agent that finds the best meeting time.
L3: Self-Improving Learns from outcomes and refines future decisions. An AI model that adapts from user feedback.
L4: Collaborative Coordinates with other agents or systems to achieve shared goals. A multi-agent setup managing complex workflows.

The six levels of agentic behavior

Everyone’s racing to build AI agents, but ask five engineers what that actually means, and you’ll get five different answers. Instead of debating definitions, let’s talk about what really matters—what these systems can actually do.

How much autonomy, reasoning, and adaptability do they have? Where do they hit a wall? And how close are we to agents that can truly operate on their own?

That’s where things get interesting.

At the end of the day, every AI system has some level of autonomy, control, and decision-making. But not all autonomy is the same.

To make sense of this, we put together a six-level framework (L0-L5) that breaks it down. The idea comes from how AVs define autonomy — not as a sudden jump, but as a gradual, structured progression. Self-driving cars don’t reach L3+ autonomy without first mastering lane assist, adaptive cruise, and automated parking—each capability building on the last.

AI agents follow the same pattern, with each level adding more complexity, reasoning, and independence.

Below is how we break it down. If you’ve got thoughts, we’d love to hear them—this is one of those topics we could talk about for hours.

L0: Rule-Based Workflow (Follower)

At this level, there’s no intelligence—just if-this-then-that logic. Think of it like an Excel macro:

  • No decision-making—just following predefined rules.
  • No adaptation—any changes require manual updates.
  • No reasoning—it doesn’t "think," just executes.

Examples? Traditional automation systems like Zapier workflows, pipeline schedulers, and scripted bots. Useful, but rigid—they break the moment conditions change.

L1: Basic Responder (Executor)

Now, we start seeing a tiny bit of autonomy.

At this level, AI can process inputs, retrieve relevant data, and generate responses based on patterns. But it still lacks real agency—it’s purely reactive, doesn’t plan, and has no memory.

But here’s the key limitation: no control loop. No memory, no iterative reasoning, no self-directed decision-making.

It’s purely reactive.

As we move up the levels, you’ll see how small changes—like adding memory, multi-step reasoning, or environment interaction—start unlocking real agency.

L2: Use of Tools (Actor)

At this stage, AI isn’t just responding—it’s executing. It can decide to call external tools, fetch data, and incorporate results into its output. This is where AI stops being a glorified autocomplete and actually does something. This agent can make execution decisions (e.g., “Should I look this up?”).

The system decides when to retrieve data from APIs, query search engines, pull from databases, or reference memory. But the moment AI starts using tools, things get messy. It needs some kind of built-in BS detector—otherwise, it might just confidently hallucinate the wrong info.

Most AI apps today live at this level. It’s a step toward agency, but still fundamentally reactive—only acting when triggered, with some orchestration sugar on top. It's also doesn't have any iterative refinement—if it makes a mistake, it won’t self-correct.

L3: Observe, Plan, Act (Operator)

At L2, AI isn’t just reacting—it’s managing the execution. It maps out steps, evaluates its own outputs, and adjusts before moving forward.

Here’s what changes:

  • Detects state changes – Watches for triggers like DB updates, new emails, or Slack messages.
  • Plans multi-step workflows – Doesn’t just return output; sequences actions based on dependencies.
  • Runs internal evals – Before moving to the next step, it checks if the last one actually worked.

It’s a big step up from simple tool use, but there’s still a limit—once the task is complete, the system shuts down. It doesn’t set its own goals or operate indefinitely. Even when Sam and his team ship GPT-5, it’ll still be stuck at L2—a fancy orchestrator, not a truly autonomous agent.

Right now, these workflows are closer to sophisticated automation than agency.

Powerful? Absolutely. Self-directed? Not quite.

L4: Fully Autonomous (Explorer)

At L3, agents start behaving like stateful systems. Instead of running isolated task loops, they:

  • Maintain state – They stay alive, monitor environments, and persist across sessions.
  • Trigger actions autonomously – No more waiting for explicit prompts; they initiate workflows.
  • Refine execution in real time – They adjust strategies based on feedback, not just static rules.

This starts to feel like an independent system. It can “watch” multiple streams (email, Slack, DBs, APIs), plan actions, and execute without constant human nudging.

But we’re still in the early days.

Most L3 agentic workflows today don’t reliably persist across sessions, adapt dynamically, or iterate beyond predefined loops. The key word here is "reliably." There are some solutions—but do they actually work well? Debatable.

L5: Fully Creative (Inventor)

At this stage, AI isn’t just running predefined tasks—it’s creating its own logic, building tools on the fly, and dynamically composing functions to solve problems we don’t yet have answers to. It’s not just following anything; it’s designing its utilities from scratch based on the task it has.

We’re nowhere near this yet.

Today’s models are still overfitting— they’re good at regurgitating, bad at real reasoning.

Even the most powerful models (e.g. o1, o3 , Deepseek R1) still overfit, and follow hardcoded heuristics.

But this is the goal: AI that doesn’t just follow instructions but figures out new ways to improve, create, and solve problems in novel ways.

Where are we now?

Here at Vellum, we’ve worked with companies like Redfin, Drata, and Headspace—all deploying real-world AI applications. And here’s what we’re seeing:

Most AI systems today sit at L1.

The focus is on orchestration—optimizing how models interact with the rest of the system, tweaking prompts, optimizing retrieval and evals, and experimenting with different modalities. These are also easier to manage and control in production — debugging is somewhat easier these days, and failure modes are kinda predictable.

L2 is where most of the action is happening right now.

Models like O1, O3-mini, and DeepSeek are paving the way for more intelligent multi-stage workflows. We're also seeing some really cool new products and UI experiences pop up as a result.

Most enterprises don’t touch L2—for now, it’s almost entirely startups pushing this space. There’s a reason most production AI workflows are still human-in-the-loop—LLMs don’t handle edge cases well, and debugging an agent that went off the rails three steps ago is a nightmare.

L3 and L4 are still limited.

The tech just isn’t there yet—both at the model level (LLMs cling to their training data like a security blanket) and at the infrastructure level, where we’re missing key primitives for real autonomy.

Current limits

Even the most powerful models still overfit like crazy.

Last week, we ran an eval using well-known puzzles—ones these models have definitely seen in training. Then we tweaked them slightly. The results? The models couldn’t adapt and just regurgitated the solutions they learned, even when they didn’t fit the new version of the problem.

Take DeepSeek-R1—trained primarily with pure RL instead of a massive corpus. You’d think it would generalize better, right? Nope. Still overfits. Feels like we’re staring at a local maxima with these models.

And here’s the problem: truly autonomous agentic workflows depend on models that can actually reason, not just remix training data. Right now, we’re nowhere close.

So yeah, we’ll see incremental improvements. But a real leap to L3 or L4?

Not a sure thing. It might take a fundamental breakthrough (looking at you, Ilya Sutskever)—or we might just be stuck here for a while.

Move up the stack with Vellum

If AI agents are going to move up the stack, teams need better ways to test, evaluate, and refine their workflows.

That’s where Vellum comes in.

Right now, most AI development relies on trial-and-error—tweaking prompts, adjusting logic, and hoping for the best. But as your workflows become more complex (especially at L2+), debugging becomes a nightmare. One wrong tool call, one bad retrieval step, and everything breaks three layers deep.

Vellum provides strong unit and end-to-end workflow testing to make iteration faster and more effective. Whether you're refining agent logic or testing edge cases, a flexible framework can help you to reliably move up the L0-L4 stack.

Book a call to chat with one of our AI experts and see how Vellum can help you move up the stack.

FAQs

1) What does “agentic behavior” mean in AI?

Agentic behavior describes how independently an AI system can reason, decide, and act toward goals without human input. With Vellum, teams can prototype and build AI agents by prompting Vellum with natural language or by manually drag-and-dropping nodes in the Workdlow sandbox. These agents range from simple task automations to production grade agentic systems.

2) Why does understanding levels of agentic behavior matter?

Understanding these levels helps organizations assess their AI maturity and plan safe, scalable upgrades. Vellum makes this easier by letting you visualize and orchestrate agent workflows using shared components and version-controlled logic.

3) How is agentic behavior different from automation?

Automation follows fixed rules, while agentic systems adapt based on context and goals. Vellum combines both through reusable components that support deterministic automation and adaptive decision-making in one collaborative environment.

4) Can AI agents make mistakes when acting autonomously?

Yes, which is why version testing and evaluations are critical. Vellum includes an Evaluations sandbox, human-in-the-loop review, mock inputs/tools, and transparent execution logs in the Workflow Console that help teams validate agent outcomes before and after deployment.

5) What technologies enable higher levels of agentic behavior?

Language models, retrieval systems, and orchestration frameworks are key enablers. Vellum provides an integrated platform to experiment with memory, context passing, and reasoning strategies without needing to rebuild infrastructure.

6) Where does Vellum fit in this framework?

Vellum serves as the foundation for building, testing, and managing agentic systems collaboratively. It reduces engineering overhead by giving teams a visual, centralized space to design and control agent behavior.

7) How do multi-agent systems work?

Multi-agent systems use specialized agents that share context and work together toward a common objective. Vellum supports these setups through collaborative workflows and sandboxed environments that make coordination and monitoring easier.

8) What are the risks of advancing too quickly toward full autonomy?

Overextending autonomy can lead to unpredictability or performance drift. Vellum helps manage this by tracking every version and run so teams can roll back changes, compare results, and maintain safe levels of control.

9) How can I measure my AI’s level of agentic behavior?

Evaluate autonomy, goal tracking, and adaptability. Within Vellum, these traits can be observed directly through execution traces and evaluation sets, making it easy to benchmark progress over time.

10) How does agentic behavior relate to AI governance and safety?

Higher autonomy requires versioning and evaluations paired with governance features. Vellum supports governance with audit trails, permission controls, and reproducible run histories that keep agent behavior transparent and trackable.

ABOUT THE AUTHOR

ABOUT THE reviewer
Nicolas Zeeb
Technical Content Lead

Nick is Vellum’s technical content lead, writing about practical ways to use both voice and text-based agents at work. He has hands-on experience automating repetitive workflows so teams can focus on higher-value work.

Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

Nico Finelli
Founding GTM

Nico is a Data Scientist specializing in machine learning and AI, with a knack for building innovative solutions. As a founding GTM leader, he bridges the gap between technical expertise and go-to-market strategy.

David Vargas
Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

lAST UPDATED
Oct 9, 2025
share post
Expert verified
Related Posts
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
Guides
October 6, 2025
15
A practical guide to AI automation
LLM basics
September 25, 2025
8 min
Top Low-code AI Agent Platforms for Product Managers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Q&A RAG Chatbot with Cohere reranking
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Q&A RAG Chatbot with Cohere reranking
Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.