Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!
Announcing Vellum

We’re excited to publicly announce the start of our new adventure: Vellum

Written by
Reviewed by:
No items found.

Hi everyone 👋

We’re excited to publicly announce the start of our new adventure: Vellum. We’re in Y Combinator’s current batch (W23) and our mission is to help companies get the best results from Large Language Models like GPT-3. Our product helps developers evaluate, manage and A/B test AI models/prompts to increase quality and reduce cost.

What problems are we trying to solve?

Since GPT-3 launched in 2020 we saw companies like Jasper and find compelling sales & marketing use cases. In the last 2 years the rate of improvement of these foundation models has been staggering, as clearly evidenced by OpenAI’s ChatGPT and models from Cohere and AI21.

With all these advances, companies around the world are looking to incorporate Large Language Models (LLMs) for generation and classification use cases both for internal applications and in their core product. However, we’ve seen 3 challenges when companies try to bring these models into production. These obstacles result in slower iteration cycles and suboptimal configurations of these Large Language Models.

  • Initial setup and deployment is difficult
  • Monitoring and other best practices require engineering teams to write lots of custom code
  • Ongoing model optimization and evaluation is time consuming and requires deep technical knowledge

Going from 0 -> 1

When coming up with initial prompts, we’ve seen firsthand the challenges developers face when choosing between model providers1, foundation models2, and model parameters3. Several browser tabs are needed to perform experiments and results are stored in long spreadsheets for side-by-side comparison. There’s no good way to collaborate with colleagues while iterating on prompts.

Choosing the right prompts often comes down to a time-boxed guessing game and you are never sure if a better outcome is possible – forget about spending the time to try fine-tuning!

Managing Once in Production

Once the right prompt/model is deployed, a lot of internal custom code is needed to track model/prompt version history and an audit log of model inputs, outputs and ground truth results from the end user. Setting up this infrastructure is important to measure performance, experiment with new prompts, and revert to older model versions if the changes are not ideal. These LLMs are so sensitive that a single word change in your prompt could provide dramatically different results. Because of this, most developers are reluctant to iterate and try to improve the model in fear that it’ll break existing behavior.

The time spent building and maintaining monitoring and testing infrastructure is non-trivial and could instead go towards building your core product.

Optimizing to Get the Very Best

Once models have been running in production and the right tooling is set up, there's usually data available to fine-tune the models to provide better quality at a lower cost. However, setting up the right fine tuned model in production has its own challenges: getting training data in the right format, trial and error for different hyper parameter combinations, and retraining as new training data is collected.

To add to the complexity, this problem is only expected to increase over time as there are new model providers and foundation models, each with their own cost and quality tradeoffs. To keep up with the cutting edge, you have to constantly spend time evaluating new models as they’re released.

Why we chose this problem

We worked together at Dover (YC S19) for 2+ years where we built production use-cases of LLMs (both generation and classification). Noa and Sidd are MIT engineers who have worked DataRobot’s MLOps team and Quora’s ML Platform team respectively.

We realized that all the ops tooling we had built for traditional ML didn’t exist for LLMs. We’d build these reasonable production use-cases of AI only to then be hesitant in making changes and improving our setup due to a lack of observability. We ended up having to build custom internal tooling to solve for this.

We’ve come to deeply feel the pains and requirements of using LLMs in production, user-facing application. We’ve decided to productize our learnings and share them with other companies so more people can make use of Generative AI without having to overcome the steep learning curve we went through.

What's next for Vellum

We’re at the beginning of an exciting journey and will be releasing several products and sharing best practices on how to work with LLMs. Stay tuned for updates on our blog!


- Akash, Sidd & Noa

1Model provider examples: OpenAI, Cohere, AI21
2Foundation model examples: GPT-3’s Ada, Babbage, Curie and Davinci
3Parameter examples: Temperature, Top-P

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

ABOUT THE reviewer

No items found.
lAST UPDATED
Feb 2, 2023
Share Post
Expert verified
Related Posts
All
September 16, 2025
12 min
MCP UI & The Future of Agentic Commerce
Guides
September 16, 2025
4 min
Google's AP2: A new protocol for AI agent payments
Guides
September 15, 2025
6 min
We don’t speak JSON
LLM basics
September 12, 2025
10 min
Top 13 AI Agent Builder Platforms for Enterprises in 2025
LLM basics
September 12, 2025
8 min
Top 12 AI Workflow Platforms in 2025
Customer Stories
September 8, 2025
8
How Marveri enabled lawyers to shape AI products without blocking developers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Email Signup
Sorts the trigger and email categories
Come to our next webinar
Description for our webinar
New CTA
Sorts the trigger and email categories

Start with some of these healthcare examples

Personalized healthcare explanations of a patient-doctor match
An AI workflow that extracts PII data and match evidence then summarizes to the user why a patient was matched with a specific provider, highlighting factors like insurance, condition, and symptoms.
SOAP Note Generation Agent
This agentic workflow generates a structured SOAP note from a medical transcript by extracting subjective and objective information, assessing the data, and formulating a treatment plan.

Start with some of these insurance examples

No items found.

Start with some of these agents

Automated Code Review Comment Generator for GitHub PRs
This agentic workflow automates the process of generating a code review comment for a GitHub pull request based on predefined guidelines. It retrieves the pull request details, analyzes the code changes, and formats a structured comment that can be posted back to GitHub.
Personalized healthcare explanations of a patient-doctor match
An AI workflow that extracts PII data and match evidence then summarizes to the user why a patient was matched with a specific provider, highlighting factors like insurance, condition, and symptoms.
SOAP Note Generation Agent
This agentic workflow generates a structured SOAP note from a medical transcript by extracting subjective and objective information, assessing the data, and formulating a treatment plan.