Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

The Four Pillars of Building LLM Applications for Production

Learn how successful companies develop reliable AI products by following a proven approach.

Written by
Reviewed by
No items found.

We’ve learnt a ton about best practices for LLM development since we first launched close to a year ago. We’ve spoken to over 1,500 people at varying maturities of using LLMs in production and want to distill some of our learnings in this post for you.

It’s easy to whip together a prototype of an AI-powered feature using popular open source frameworks, but we repeatedly see people having difficulty crossing the prototype-to-production chasm. They deploy to production and then quickly run into countless edge cases and wonder why their AI application isn’t working well.

A common pattern among successful companies has emerged – they’ve invested in 4 key areas to help with their AI development process.

Data

LLMs are trained on the general internet and don’t have access to your proprietary data.

LLMs are also inherently stateless.

They’re only as good as the context you provide to them at run-time. Companies who try to use AI to create personalized experiences without providing the AI with personalized data struggle.

Those that invest in providing data unique to their company, their industry, or their customers fare much better.

Here are some questions to keep in mind while thinking about the data you pass into your LLM:

  • What data is unique to you, your company, industry, or customers?
  • How can you best structure and query for this data at runtime, such that it can be included in a prompt that’s sent to an LLM?
  • What is the best prompt/model to wrap that data resulting in a meaningful output?

Retrieval Augmented Generation via embedding models and vector DBs are a common architecture best suited for unstructured text. If you’ve built a RAG system before, you already know the challenges:

Which Vector DB should I use? OpenAI just released a new embedding model  – is it better for my task? Now I need to make a document ingest pipeline? What are my reranking options?

Keep in mind though that you likely already have a bunch of great structured data and can lean on good ol’ SQL queries as a starting point.

Experimentation

Unlike traditional software engineering, LLMs are non-deterministic, unpredictable, and getting them right requires a lot of trial and error.

While trying to pick the right approach for a task, we see people optimizing between quality, cost, latency and privacy requirements.

Here’s how we see customers thinking about experimentation:

  • First, what’s the right architecture for my application? Single prompt + single completion, chat or a multi-step chain with custom business logic tying the prompts together. RAG? OpenAI or Anthropic or Mistral? One big prompt for GPT-4 or multiple specialized GPT 3.5 prompts?
  • Questions regarding experimentation are best answered when you have a clear evaluation process and metrics. The eval process depends on the task at hand and whether you have access to “ground truth” data or “expected outputs”. Exact match or Regex match could be good for classification tasks, Semantic similarity is good for Q&A or generative responses. LLM and code-based eval based metrics (e.g., is this valid JSON schema?) may not require ground truth and could be useful additional metrics to add in your testing process.
  • Next, think about how many test cases you want to test against? High bar for accuracy and low risk tolerance in case something goes wrong? Make sure to have a large number of test cases in your test bank.
  • Finally, think about who will do the testing and experimentation? Does it have to be your engineering team? Often non technical people who are closer to customers may be a better fit to modify prompts until they clear test cases. Find a way to get this experimentation out of your code to make the whole team move faster.

Experimentation is potentially the most important pillar while building your AI application. Effective experimentation and a good testing framework gives you the confidence needed to bring your AI prototype to production.

Those companies that don’t invest in experimentation frameworks waste time, get frustrated, and ultimately land on less-than-ideal solutions.

Lifecycle Management

After wrangling your data, feeding it into a prompt, and performing your experimentation, you’re ready to ship your AI application to production – congratulations!

However, you’ll inevitably need to make changes…

Software engineering as a profession has existed for decades and with it, a whole industry of tooling for lifecycle management has matured. We have Github for version control, Datadog for monitoring, CircleCI for CI/CD, the list goes on... But the age of robust AI tooling is only just beginning.

Regardless of whether you buy a solution or build your own, safely making changes once in production is critical.

Here are the common actions performed by the most successful companies we work with:

  • Log all the calls you make to the LLM provider: inputs, outputs, exact provider payload, latency.
  • If your application uses chained prompts, track the inputs, outputs and latency at each step for full traceability.
  • If you’re a chart junkie like us, throw this raw data in a visualization tool for better observability: number of tokens over time, latency over time, errors over time etc. Use your creativity and make charts to track whatever trends are most important to you.
  • Set up alerts. If latency exceeds a set limit, your system should alert you rather than the user.
  • If possible, capture implicit or explicit user feedback for each completion. Explicit user feedback is collected when your users respond with something like a 👍 or 👎 in your UI when interacting with the LLM output. Implicit feedback is based on how users react to the output generated by the LLM. For example, if you generate a first draft of en email for a user and they send it without making edits, that’s likely a good response! Measuring quality over time becomes a lot easier if you keep track of user feedback.

While making changes to your AI application (either single prompt or multi-prompt):

  • Maintain good version control and version history. Pin to stable versions in production and use staging environments for testing where possible. Maintain the ability to quickly revert back to an old version.
  • Replay historical requests with your new prompt / prompt chain and make sure nothing breaks. Regression testing is vital to give you peace of mind that your AI application won’t degrade.

If software never had to change, things would be easy!

That’s rarely the case.

Good tooling for Lifecycle Management is necessary as you iterate, evolve, and make changes. Get the basics right and you’ll sleep more easily.

Continuous Improvement

You’ve instrumented monitoring, logging and are maybe even capturing user feedback. This data is incredibly valuable because you can create a feedback loop back to Pillar 1 – Data and continuously improve the quality of your application over time.

Embed the right feedback loops and your competitive data moat deepens automatically.

  • Any time you find an edge case where the model didn’t perform well, add it to your bank of test cases. Make sure your application clears that edge case next time you make a change to your AI application.
  • If you apply advanced techniques like dynamic few shot prompting via RAG, good completions can be added to your vector DB to provide better context to the model next time the model sees a similar request.
  • Build a caching layer if you’d like to save cost or lower latency. Each time your AI application sees an input similar to what it has seen before you can get the response from cache instead of calling the LLM again.
  • Once you have a large enough bank of good quality completions, use that data to fine tune a custom model (either OpenAI or open source) to further reduce cost and latency and increase accuracy.

Summary

Data, Experimentation, Lifecycle Management, and Continuous Improvement – these  are the four pillars of building a production-grade AI application.

Invest in them and you’ll be amazed at how quickly you can cross the prototype to production chasm.

Need help getting started?

All this may sound daunting, but luckily, you don’t have to build it all yourself.

Vellum is a production-grade AI development platform that gives you the tools and best practices needed across all four pillars without needing to build complex internal tooling.

Reach out to me at akash@vellum.ai or book a demo if you’d like to learn more.

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

ABOUT THE reviewer

No items found.
lAST UPDATED
Jan 29, 2024
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Population health insights reporter
Combine healthcare sources and structure data for population health management.
Prior authorization navigator
Automate the prior authorization process for medical claims.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Q&A RAG Chatbot with Cohere reranking

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
Q&A RAG Chatbot with Cohere reranking

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.