Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

How Should I Manage Memory for my LLM Chatbot?

Tips to most effectively use memory for your LLM chatbot.

Written by
Reviewed by
No items found.

When building a chatbot using a Large Language Model (LLM), app developers often want to draw upon a user’s previous conversations to inform newly generated content.

In order to create a highly personalized experience for each user, it’s critical to have the ability to reference earlier messages and prior conversations.

For example, one of our customers is building a chatbot that acts as a medical provider/physician. Since the provider-patient relationship is long term in nature and tailored to each patient’s needs, effective memory management is a crucial part of their product. Managing memory incorrectly results in awkward, generic experiences, but if you get it right, you open up exciting new ways of building products.

LLMs are inherently stateless, meaning they don’t have a built-in mechanism to remember or store information from one interaction to the next. Each request to the model is processed independently, without any knowledge of previous requests or responses. This stateless nature is a fundamental characteristic of LLMs, which can pose challenges when developing applications that require context or memory.

If your users are engaged, they’ll eventually build up long conversation histories, and given limited context windows & cost/latency criteria, it’s vital to carefully consider how you’re going to give the model the right context to respond to each message.

LLM Memory Management Options

Using LLMs with Bigger Context Window

Each response generated by an LLM is based on the context provided at runtime.
Context windows of popular models range from 4k tokens (or 3,000 words) to 128k tokens (or 96,000 words). The LLM providers that offer the biggest context windows are Anthropic and OpenAI.

As the developer of an LLM powered chatbot, you need to determine how to best leverage the context you’re passing in to implement memory. Adding every conversation without modifications into the context window can quickly result in high cost, high latency and context window limitations.

Using Vector Databases to store your data

When there are a lot of prior conversations that may need to be referenced in the current user interaction, you also have the option to store this prior information in a vector database like Pinecone or Weaviate.


As the developer of this chatbot, correctly retrieving the information from a vector database and adding to the context window is another aspect to consider when referencing information from outside the current conversation.

Note: if a given conversation gets too long you could store that in a vector DB too and reference just the relevant material using metadata filtering.

Memory Management Strategies

Active Conversational Memory

While the easiest option to manage memory in a conversation is passing the full chat history in the prompt, its not the most efficient option due to cost and latency concerns.
As the conversation history grows, the model will take longer to process the input and generate a response, leading to increased latency. Additionally, the cost of running the model increases with the length of the text, making this approach potentially expensive for long conversations.

Summarization & buffering are two techniques that can help and you can use them in isolation or additively. While choosing an option here, you need to determine if you want the context from the whole conversation and/or only the most recent messages.

Here are the options available to you:

  • Summarization — Recursively summarize the conversation every time the conversation exceeds some threshold of conversations/messages: Retains context from the whole conversation but you may lose some details;
  • Buffering — Pass in the the last N messages: The chatbot doesn’t have memory of the older messages but has complete recent context;
  • Summarize conversation up to Nth message and pass in last N messages: This gives the best of both worlds but has a larger number of tokens in the request;
  • Specific context retrieval: Long conversations can be stored in a vector DB and the most relevant pieces can be added based on information in last N messages. This saves tokens but the model may entirely miss context adjacent to relevant messages in prior interactions.

Retaining Past Conversational Memory

If you’re building a chatbot that needs to remember prior interactions with a user, old conversations can be stored in a vector database and referenced at runtime.

Long-Term Memory Management Strategies

Long term memory management boils down to using two strategies in tandem:

  1. Relevance: Identifying the parts of the past conversations that are likely to be useful in generating the next message;
  2. Compression: Reducing the number of tokens needed to represent the potentially relevant messages you’d like to pass in.

Relevance

The trick to pulling in the most relevant messages is defining what relevant means in your use case. Some common signals that a message is likely to be relevant:

  • Recency of the message - the last few messages are more likely to be relevant than a message from last month
  • Similarity - messages with similar embeddings or using the same rare keywords as the last few messages are more likely to be relevant
  • Session context - the user’s recent actions in your product likely provide information on which messages may be more relevant

Compression

Compression, also known as Summarization, involves sending prior Chat History to a prompt with a large context window to summarize the conversation or extract critical details/keywords, and then send those as an input to another prompt that then operates on that compressed representation of chat history. You can adopt various prompt engineering techniques to compress prior conversations (e.g., simple summary, extract major themes, provide a timeline etc.), and the correct choice depends on the user experience you’re trying to enable.

Keep in mind you can mix Relevance w/ Compression to compress only some subset of past chat messages.

Putting it all together for your application

A mix and match strategy between memory in conversation and from prior conversations seems most promising. Use a Prompt that takes two input variables – one is the memory from current conversation, and the other is a compressed representation of all relevant prior interactions.

As you can see, the approach to manage memory for your LLM chatbot depends a lot on the user experience you’re trying to create. If you’d like tailored advice on your use case and want to build these approaches in our application without building much custom code, request a demo here. We’re excited to see what you end up building with LLMs!

FAQs

What is the role of Pinecone?

Pinecone is a vector database that enables fast storage and information retrieval for your AI apps.

Why is adding memory to LLM complex?

The process is very complex because it demands rigorous testing and optimization. Also, storing a lot of new data can cause some of that information to overlap or "interfere" with other pieces of information that the model already has.

ABOUT THE AUTHOR
Akash Sharma
Co-founder & CEO

Akash Sharma, CEO and co-founder at Vellum (YC W23) is enabling developers to easily start, develop and evaluate LLM powered apps. By talking to over 1,500 people at varying maturities of using LLMs in production, he has acquired a very unique understanding of the landscape, and is actively distilling his learnings with the broader LLM community. Before starting Vellum, Akash completed his undergrad at the University of California, Berkeley, then spent 5 years at McKinsey's Silicon Valley Office.

ABOUT THE reviewer

No items found.
lAST UPDATED
Feb 14, 2024
share post
Expert verified
Related Posts
January 10, 2026
8 min
Vellum Product Update | December
All
December 12, 2025
7 min
How we use coding agents to 2x engineering output
LLM basics
December 12, 2025
8 min
GPT-5.2 Benchmarks
LLM basics
December 4, 2025
8 min
Top 12 AI Workflow Platforms
Product Updates
December 3, 2025
12 min
Vellum Product Update | November
Model Comparisons
November 27, 2025
18 min
Flagship Model Report: Gpt-5.1 vs Gemini 3 Pro vs Claude Opus 4.5
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI) = {{roi-cta}}

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Content Repurposing Agent
This agent transforms a webinar transcript into publish-ready content.
Reddit monitoring agent
Monitor Reddit for new posts and send summaries to a specified Slack channel.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

No items found.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Customer support agent
Support chatbot that classifies user messages and escalates to a human when needed.
Trust center RAG Chatbot
RAG chatbot for internal policy documents with reranking model and Google search.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.
Roadmap planner
Agent that reviews your roadmap and suggests changes based on team capacity.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.
PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).
Roadmap planner
Agent that reviews your roadmap and suggests changes based on team capacity.
Renewal tracker agent
Create an agent that scans HubSpot for deals with upcoming renewal dates in the next 60 days.
Client portfolio review agent
Compiles weekly portfolio summaries from PDFs, highlights performance and risk, builds a Gamma presentation deck.

Build AI agents in minutes for

{{industry_name}}

Roadmap planner
Agent that reviews your roadmap and suggests changes based on team capacity.
Account monitoring agent
Combines product usage data with CRM data from HubSpot or Salesforce to flag accounts with declining usage, especially ahead of renewals.
Cross team status updates
Scans Linear for stale, blocked, or repeatedly reopened issues, flags patterns, and uses Devin to propose cleanup or refactor suggestions.
SEO article generator
Generates SEO optimized articles by researching top results, extracting themes, and writing content ready to publish.
Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.