Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

What is Required for a Reliable AI System?

Learn the key strategies and tools for building production-ready AI systems.

Written by
Reviewed by
No items found.

Having worked with thousands of companies building AI systems in production, we’ve identified some common patterns in companies who are able to take their flashy demos / proof of concepts to production use and get ongoing business value.

Today our platform powers more than 20M API requests in production per month and enables the development of AI products for iconic brands like Redfin (read more about Ask Redfin here).

This guide is meant for product & engineering leaders who are trying to make sense of the AI development process and the necessary tooling to effectively integrate AI.

Whether these tools are built in-house or purchased from a vendor, we see 4 distinct categories emerge:

  1. Experimentation: Pick the best architecture and prompts for your task across all open and closed source models.
  2. Evaluation: Make concrete progress in LLM app development via pre-defined qualitative or quantitative evaluation metrics.
  3. Lifecycle Management: Ensure your application continues to perform well in production and make changes with confidence.
  4. Security & Collaboration: Have multiple people in your company work on your AI features with the right data security in place.

In the next sections, we'll give you easy-to-follow advice on the tools and methods you need for each step.

Experimentation

LLMs are non-deterministic, unpredictable, and getting them right requires a lot of trial and error. Having a framework that allows for rapid experimentation between various approaches is essential to build the best quality product.

Here are some items to consider while you’re experimenting:

  1. Prompt Engineering: Quickly iterate on various prompting techniques like zero shot, few shot, chain of thought or tree of thought prompting to see which one works best for your task at hand. Tweak some LLM parameters like top_p, top_k, temperature, stop_sequence, frequency & presence penalty to control how the LLM behaves and generates responses.
  2. Multi-step architectures: Often times one prompt may not be enough to consistently achieve the task at hand. For instance, for a support chatbot, a common pattern we see is having an LLM powered intent classification upfront before having specialized downstream prompts to handle the intents. Make sure your experimentation isn’t limited by a single prompt!
  3. RAG pipelines: Retrieval Augmented Generation (or RAG) is a common architecture choice when your application needs to refer to context from a knowledge base. Your retrieval results can vary based on choice of embedding model, filtering, chunking & retrieval strategies so remember to test across these options.
  4. Model provider agnostic: Every month we see new foundation models being announced by providers like OpenAI, Anthropic, Mistral, Google & Meta. Our leaderboard shows that Anthropic and Google’s latest models currently outperform GPT-4 on general benchmarks. Your experimentation framework should be model provider agnostic so you can pick the best model for the task at hand.
  5. Version control your experimentation: Experimentation is more science than art. Keep track of every iteration you try so you can pick the best elements from each attempt as you make improvements.
  6. (Bonus) Bring people beyond software engineers into prompt engineering: Prompts are written in natural language, you will save valuable software engineering time if your framework allows for experimentation by non technical team members.

Given the non-deterministic nature of Large Language Models, we always recommend having a test driven development framework while experimenting with LLMs.

After completing this step, you will have a clear understanding of which projects are:

  1. Realistic and achievable
  2. Likely to have the greatest positive effect
  3. The most favorable to develop based on current circumstances and resources.

Evaluation

Knowing the goalpost for your experimentation is key, and that’s where Evaluation comes in.

At this stage, you’re done with your experiments, and you want to optimize for quality, cost, latency & privacy considerations while building the application.

Here’s what to have in mind while in this phase:

  1. Cost: Most models charge per token and pricing is available online directly with the model providers. Remember to keep context window and response length in check if you’re looking to optimize cost.
  2. Latency: Different models have different latency; we had published a report about this a few months ago: latency comparison. Since then Claude 3 Haiku has been released which has really low latency given its quality. Pick the right set of models for your task and consider time to first token as your key metric if streaming is acceptable for your end user experience.
  3. Privacy: Some models may be entirely “out-of-scope” for you if you don’t have the right legal provisions set up with the model providers or if the data must live on-premise.
  4. Quality: Ultimately, quality is usually the most important dimension because developers don’t put apps in production if they don’t meet the quality threshold. The rest of this section covers successful strategies to measure the quality of your LLM output.

After you’re done optimizing, you’re ready to test whether your AI system is going to work reliably against your test cases.

Using ground truth data

First check if you have "ground truth" data for your use-case. This is the answer you'd expect the LLM to provide. Usually this data can come from either manually labeling test cases or looking at historical data you have access to.

For example: Let’s say you want to automatically pull out information from PDF files. You might already have some examples of the data you need, which were previously entered by your operations team. You can use these examples as labeled data to train your automation system.

When you have labeled data, you can compare the model generated *response* with the target response. Consider these metrics:

  1. Exact match: Useful for classification tasks. Does model generated response exactly match target response?
  2. Regex match: Also useful for classification tasks. Does model generated response match the regex pattern in target response?
  3. Semantic similarity: Useful for Q&A or generative responses where there is a correct answer. How semantically similar is model generated response to target response?
  4. JSON key-value match: Useful for data extraction tasks. Do the key value pairs match target response?

No ground truth data, no problem

If you don't have labeled data, there are some other strategies to evaluate model responses:

  1. LLM based eval: Have a downstream prompts or set of prompts evaluate the model response. This can be very custom, choose whatever evaluation prompts you believe are helpful for your task.
  2. Code eval: Use code to do your evaluation --> Is the response less than 100 characters? Is the response valid HTML? Just like LLM based eval, this can also be very custom.

Both LLM based eval and code eval can be also used to evaluate RAG pipelines.

💡 Now we won't end this section without some bonus metrics 💡

  1. JSON specific metrics: Useful for data extraction tasks. Is this valid JSON? Does the schema match target schema?
  2. Evaluation post execution: Useful for SQL or code generation tasks. Run the completed generation and see if the response was correct or not.
  3. Human evaluation: When domain specific expertise (e.g., legal, healthcare) is needed to evaluate the response quality, having human evaluators grade the output would be your best bet.
  4. RAG evaluation: There are metrics to evaluate the quality of your retrieval and generation, RAGAS provides a helpful starting point to choose the metrics that matter for your use case.

While evaluating your prompt or multi-step chain, we always recommend coming up with a basket of metrics based on the task at hand and test across models and prompts to meet your quality criteria.

If you have a high bar for accuracy and low risk tolerance in case something goes wrong, make sure to have a large number of test cases in your test bank.

The tooling you use for experimentation should be flexible enough to allow you to compose metrics of your choice.

Lifecycle Management

Once your AI application is in production, you’ll inevitably need to make changes. A new model might come out, or your system may encounter edge cases. Safely making changes once in production is critical, and companies should have appropriate tooling for it.

Measuring performance in production

Here are the common actions performed by the most successful companies we work with:

  • Log all the calls you make to the LLM provider: inputs, outputs, exact provider payload, latency.
  • If your application uses chained prompts, track the inputs, outputs and latency at each step for full traceability.
  • This raw data is used in a visualization tool for better observability: number of tokens over time, latency over time, errors over time etc. Use your creativity and make charts to track whatever trends are most important to you.
  • Set up alerts. If latency exceeds a set limit, your system should alert you rather than the user. The metrics used for your unit testing (e.g., relevance, helpfulness, bias) can be run on production traffic to measure quality of the application in production. Any time edge cases are encountered in production they should be added to your unit test bank to make your next release even higher quality.
  • If possible, capture implicit or explicit user feedback for each completion. Explicit user feedback is collected when your users respond with something like a 👍 or 👎 in your UI when interacting with the LLM output. Implicit feedback is based on how users react to the output generated by the LLM. For example, if you generate a first draft of en email for a user and they send it without making edits, that’s likely a good response.
  • Support stateful API calls. While building more advanced systems like agents, you’d benefit if you correctly maintain state between API calls. By retaining state across multiple calls, the application can efficiently manage user context, adapt its behavior based on past interactions, and provide timely updates or transactional operations. Custom memory management strategies are used when needs are more nuanced.
  • Caching, retry logic, fallback logic. OpenAI may be down, your application might hit rate limits, make sure there’s a backup so your end user experience isn’t affected. Cache responses so you can save tokens.

Making changes in production

While making changes to your AI application (either single prompt or multi-prompt):

  • Maintain good version control and version history. Pin to stable versions in production and use staging environments for testing where possible. Maintain the ability to quickly revert back to an old version.
  • Replay historical requests with your new prompt / prompt chain and make sure nothing breaks. Regression testing is vital to give you peace of mind that your AI application won’t degrade.

If software never had to change, things would be easy! That’s rarely the case.

Good tooling for Lifecycle Management is necessary as you iterate, evolve, and make changes. Get the basics right and you’ll sleep more easily.

Security and Collaboration

All these prompts, test cases and production traffic need to be stored in a secure environment. They often contain trade secrets and sensitive customer data. In the development process we also see companies wanting collaboration between technical and non technical stakeholders (e.g., subject matter experts). Engineers often determine the testing methodology and decide when something is ready for production while subject matter experts can help with prompt engineering & evaluation to share the development load.

Here are some items to consider:

  1. Audit logs: Keep track of who made what changes to your prompts both during experimentation & in production. This will come in handy if there’s a need to investigate an incident down the road.
  2. Virtual Private Cloud environment: If using a SaaS vendor, consider a VPC installation for higher security. By using a VPC, you can ensure that your application and data are protected from unauthorized access, as it allows you to define and manage your own network configurations, including subnets, routing tables, and access control policies. Additionally, a VPC can improve performance and reliability by enabling you to deploy resources in a dedicated, private environment, reducing the risk of interference from other tenants.
  3. Role Based Access control: You may not want your non technical stakeholders to update production traffic, that’s where Role Based Access Control comes in. You can ensure that users have access only to the prompts, test suites and deployments necessary for their role, enhancing both security and operational efficiency.
  4. Multiplayer mode: Building LLM features is a collaborative process. You want people to leave comments on each other’s work, modify each other’s prompts for faster decision making and a more cohesive development process. This makes the team more productive and helps avoid changes being overwritten.

Need help getting started?

All this may sound daunting, but luckily, you don’t have to build it all yourself.

Vellum is a production-grade AI development platform that gives you the tools and best practices needed without needing to build complex internal tooling.

Reach out to me at akash@vellum.ai or book a demo if you’d like to learn more.

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

ABOUT THE reviewer

No items found.
lAST UPDATED
Jun 4, 2024
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

ReAct agent for web search and page scraping
Gather information from the internet and provide responses with embedded citations.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Q&A RAG Chatbot with Cohere reranking

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).
SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Review Comment Generator for GitHub PRs
Generate a code review comment for a GitHub pull request.
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Prior authorization navigator
Automate the prior authorization process for medical claims.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.