Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

Redfin's Test Driven Development Approach to Building an AI Virtual Assistant

Discover how Redfin used Vellum to develop and evaluate a production-ready AI assistant, now live in 14 markets.

Written by
Reviewed by
No items found.

On March 7, Redfin announced the beta launch of Ask Redfin, an AI-powered virtual assistant that provides quick answers to homebuyers' questions about properties for sale. With Ask Redfin, house hunters can easily obtain information about listings, such as upcoming open houses, monthly HOA fees, school districts, and more.

To build this conversational system, Redfin adopted a test-driven development approach to set a high bar for Ask Redfin’s ability to answer questions accurately and fairly.

With this objective in mind, they used Vellum to experiment with and evaluate different prompts and workflows across a wide array of test cases before deploying their virtual assistant into production.

For those interested in creating a production-ready AI chatbot using a test-driven development approach, keep reading to learn more.

Who is Redfin?

Redfin (www.redfin.com) is a technology-powered real estate company. They help people find a place to live with brokerage, rentals, lending, title insurance, and renovations services. They run the country's #1 real estate brokerage site. Their customers can save thousands in fees while working with a top agent.

Their home-buying customers see homes first with on-demand tours, and their lending and title services help them close quickly. Customers selling a home can have their renovations crew fix it up to sell for top dollar. Their rentals business empowers millions nationwide to find apartments and houses for rent. Since launching in 2006, they’ve saved customers more than $1.6 billion in commissions.

They serve more than 100 markets across the U.S. and Canada and employ over 4,000 people.

Why Did Redfin Choose Vellum?

Redfin’s team knew that linking to a model API and integrating unique data can create a solid proof of concept (POC), but developing a production-ready AI virtual assistant requires much more.

It involves understanding user intents, devising strategies for various responses, evaluating every step of the workflow, and selecting an optimal model and prompt combination for accurate and relevant responses.

Test-Driven AI Virtual Assistant Development

Redfin was on the hunt for an AI development platform that could help them facilitate a test-driven development approach to developing a reliable conversational system. They wanted to iteratively test their logic, and achieve the highest precision possible when dealing with a variety of customer questions — all while trying to be particularly thoughtful about fair housing.

Vellum's products were the perfect fit for Redfin's requirements. By integrating with Vellum, Redfin’s product and engineering teams were able to collaborate far more effectively and quickly scaled out the building and testing of their chatbot logic.

We sat down with Sebi Lozano, Redfin’s Senior Product Manager, to learn more — here’s Redfin’s journey from a simple concept to a fully implemented, cutting-edge AI virtual assistant that enhances the home-buying experience for users nationwide.

How does Redfin use Vellum today?

Collaborate on Prompts

Redfin used Vellum’s prompt engineering environment to pick the right prompt/model for a given task. They iteratively tested prompts to evaluate Ask Redfin’s ability to answer questions correctly.

Prompt Engineering is a core part of any LLM application & Vellum’s tooling made it much faster to create good prompts.

📹 Here’s a quick demo on how “Prompts” work.

Build Complex AI Virtual Assistant Logic

Given Redfin’s scale, they cared deeply about minimizing cost and latency without sacrificing quality. To accomplish this, they had to break down conversational flows into several nodes. They used “Vellum Workflows” to connect the prompts, classifiers, external APIs and the data manipulation steps into one multi-step AI workflow.

In the Workflow builder, they were able to connect all this logic by using customizable nodes that can handle various data, tools and LLM tasks.

Their product team was able to independently test changes, make tweaks to prompts, and even try out entirely new chains and then collaborate with engineering to productionize the best of the best.

📹 Here’s a quick demo on how “Workflows” work.

Systematically Evaluate Prompts Pre-Production

Generative AI chatbot development, aka prompt engineering, is extremely iterative. When you make one change to a prompt, you want to make sure that it’s had the effect you expected and that it didn’t create a regression in another part of the system.

To navigate this complexity, Redfin used “Vellum Evaluations”, which allowed them to rigorously test each prompt/model combination. They used hundreds of test cases to evaluate how well the virtual assistant was answering questions.

This approach enabled them to evaluate all LLM outputs in their virtual assistant logic, across multiple intent and action combinations, to ensure they met their quality threshold.

💡 Learn more about “Vellum Evaluations” on this link.

Learn from Redfin's journey in this recent webinar where we covered how Redfin evaluated their virtual assistant before they sipped it nationwide:

What impact has this partnership had on Redfin?

By using Vellum’s technology, Redfin was able to follow a test-driven approach to developing Ask Redfin which gave them the confidence to launch the virtual assistant as a Beta in 14 markets across the U.S.

Preview of Ask Redfin (Beta)

Sebi, Senior Product Manager at Redfin, says that to enable test-driven development, it’s crucial to separate prototyping from coding to speed up the team's workflow.

Using Vellum for testing our initial ideas about prompt design and model configuration was a game-changer. It allowed us to work without always needing engineering resources, enabling a broader group of people to work on prompts and to work faster without needing to deploy code to test changes. Once we had satisfying results from the prototyping phase, we then handed over the process to our engineers to integrate it into our production system. Vellum’s software, and their knowledgeable team, saved us hundreds of hours.

- Sebi Lozano, Senior Product Manager at Redfin

Apart from this, Sebi shares that it was really easy for Redfin’s team to:

  • Evaluate which models will get the best value for the lowest cost: It was fairly easy for the whole team to evaluate different models and evaluate various prompts. By analyzing performance and price, they can make better tradeoffs between models, and can project the expenses for when real users begin using the chatbot.
  • Evaluate intent handlers at scale: Not only were they able to evaluate their prompt combinations, they also got confidence about the quality of the responses due to large scale testing using the “Evaluations” product. They were able to test their prompts with known and tricky scenarios, but also with new variations that were synthetically generated using another LLM call.
  • Learn best practices: Redfin collaborated weekly with the Vellum’s in-house AI experts to tackle prompt hallucinations, learn the latest prompting techniques, and unblock any other problems they faced during the process.

Our collaboration with Redfin demonstrates how a test-driven approach, supported by the right tools, can accelerate the development of an AI-powered virtual assistant.

Observations and Learnings

For those interested in building a production-ready AI virtual assistant, the journey of Ask Redfin serves as an insightful guide. It underscores the value of a methodical approach to AI development, where continuous testing and refinement play critical roles in achieving a successful outcome.

Prompt Engineering Tips

Sebi also shares some prompt engineering tricks that helped them in their development process:

  • Be extremely explicit about how you want the LLM to evaluate an answer. Thinking of an LLM as an “intern” resonated with us.
  • Repeating phrases helped the LLM perform as expected (i.e. “be extremely strict”, “remember…”
  • Chain of Thought reasoning (asking the LLM to write out its reasoning) made our classifiers more accurate and made it easier to debug issues.

You can read more about the Chain of Thought technique here, or find more prompt engineering tips in this guide.

Using Vellum

By using Vellum, Redfin was able to simulate various user interactions, test different prompts and their effectiveness across numerous scenarios.

You can find more details on how to evaluate your RAG system in our latest guide here.

Want to Try Out Vellum?

Preview of the Vellum Workflows product.

Vellum has enabled more than 100 companies to build complex AI chatbot logic, evaluate their infra and ship production-grade apps. If you’re looking to develop a reliable AI assistant, we’re here to help you.

Request a demo for our app here or reach out to us at support@vellum.ai if you have any questions.

We’re excited to see what you and your team builds with Vellum next!

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Apr 9, 2024
share post
Expert verified
Related Posts
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
Guides
October 6, 2025
15
A practical guide to AI automation
LLM basics
September 25, 2025
8 min
Top Low-code AI Agent Platforms for Product Managers
LLM basics
September 25, 2025
8 min
The Best AI Agent Frameworks For Developers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Personalized healthcare explanations of a patient-doctor match
SOAP Note Generation Agent

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review and error detection

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

AI agent for claims review and error detection
E-commerce shopping agent

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

SOAP Note Generation Agent
React Agent for Web Search and Page Scraping
Gather information from the internet and provide responses with embedded citations.
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Automated Code Review Comment Generator for GitHub PRs

Build AI agents in minutes for

{{industry_name}}

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
AI agent for claims review and error detection
E-commerce shopping agent
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Risk assessment agent for supply chain operations
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.