Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

First impressions with the Assistants API

Assistants API: Easy assistant setup with memory management - but what's under the hood?

Written by
Reviewed by
No items found.

The Assistants API is user-friendly as it simplifies the RAG pipeline based on best practices from ChatGPT. It also handles memory management automatically.

If you want to spin up a chatbot, this might be your fastest and easiest way to do so.

OpenAI has done an excellent job in streamlining the process, making it accessible for developers new to AI to build their own assistants in just a few hours.

But what is actually happening under the hood? How much of the process can you control?

And what should you know before using the API?

We try to answer all of these questions and provide more insights below.

Knowledge Retrieval

The built-in Retrieval tool augments the LLM with knowledge from outside its model. Once a file is uploaded and passed to the Assistant, OpenAI automates the whole RAG process that was usually custom-built by developers.

They automatically chunk your documents, index and store the embeddings, and implement vector search to retrieve relevant content to answer user queries.

This is useful because it will save a lot of time, as you rely on OpenAI to make those decisions for you.

However, it can also be limiting because you’ll have little control over the strategies and models used for the retrieval process, especially if they don’t work well for your specific needs and budget.

We explore the underlying models and algorithms, along with the associated costs, in the following section.

Embedding models / Chunking / Re-ranking

When it comes to text embedding models, OpenAI uses their best one, the adda-02 model. It can perform well on specific tasks, and it's definitely worth testing it out.

However, it's not the best one. It currently ranks on the 20th place on the MTEB benchmarks.

Other models like Instructor XL are SOTA on 70 tasks and are open-sourced. You are not able to use them with the Assistants API, and you’ll need a custom setup to test them out.

Regarding the chunking and re-ranking algorithm there is no available information on how it’s currently done under the hood.

Costs

Retrieval is priced at $0.20/GB per assistant per day, and you can only upload 20 files which can be limiting for some use-cases. If not managed well, this can get very expensive.

However, if your information is provided in text or CSV format, you can compress a large amount of content—potentially tens of thousands of pages—into a file smaller than 10 megabytes. Therefore, the limit of 20 files can appear arbitrary.

Memory Management

The Threads functionality turned the Chat Completions API from a "stateless" model (no memory) to a "stateful" one. Previous messages can be stored, eliminating the need for developers to implement custom memory management techniques.

This was the last puzzle piece that was missing from the previous Chat Completions API. Now you can capture recent conversations and provide better answers.

Bellow we look into how Threads work, memory techniques and tokens.

Entire conversation is re-send to the server with each user interaction

The current memory setup of the Assistant requires sending the entire thread to a vector database each time a new message is added.

This can lead to exploding costs and high-latency. Many developers are already flagging this issue in the OpenAI forum. Some report that they got charged over $3 for summarizing a 90 page (~55K token) pdf file.

Threads also don’t have a size limit, and you can pass as many messages as you want to a Thread. Imagine how expensive this can get with 128K context window.

To manage this, you can programmatically trim or cap the context size (depending on your needs) for pricing to not explode.

Different chatbots require custom memory techniques

When you’re building a chatbot, you need to be able to use a different memory technique that’s custom to your use-case.

Do you expect your user interactions to be short, and you want to store everything in a vector db? Or maybe you want to buffer the most recent messages and save on costs?

There are multiple memory techniques that we outline here and we don’t think that one memory technique like Threads will fit into every use-case.

No token breakdown for the Assistants API

Another thing that’s currently unclear is how many tokens is the assistant using and for which actions across retrieval, tools, generation. There is no documentation on this, and we’ll update the post as that info becomes available.

Function Calling

Function calling was improved in terms of accuracy so now GPT-4 Turbo is more likely to return the right function parameters.

But more importantly you can now pass one message requesting multiple actions, for example: “find the location and give me X near by restaurants”. This new update allows multiple actions to be executed in parallel, to avoid additional round-trips calling the API.

However, you need to set up custom triggers for when a function call should be executed.

Understanding user intents is critical for assistants

To accurately capture the intended purpose, it is crucial to have better control whether a function should be executed.

Should we fire up a function call to an external API or should the assistant just answer the message from our knowledge db?

This is currently done automatically by the Assistants API, but for specific use-cases you need to able to control this with certain triggers or rules.

These can be keyword detection, intent handlers, user preferences etc. Each method has its own advantages and complexities, and the best approach depends on the specific requirements and capabilities of your assistant.

Function calling and tokens

Under the hood, functions are injected into the system message in a syntax the model has been trained on. This means functions count against the model's context limit and are billed as input tokens.

To minimize the number of tokens, you can instruct the LLM to be concise, and possibly add “Only return the function call, and do not output anything else”. You may need to do more experiments with your prompts to see which one will get you the best answer.

Code Interpreter

You can also enable the Code Interpreter tool right in the playground UI, and that option allows your Assistant to solve challenging code problems, analyze data, create charts, edit files, perform math, etc.

Let’s look at which languages are supported, and how is the cost calculated for this tool.

Code Interpreter: Python Language Only

The code interpreter can only write and run Python code, and it supports a finite list of libraries. You can’t add external libraries, and since it doesn’t have access to the internet you can’t use some of the packages that are included.

If you need support for other languages, including Python, C++, Java, PHP, Typescript (Javascript), C#, Bash and more, a great alternative is using Code LLama. It’s also open-sourced and free for commercial use.

Code Interpreter cost in the Assistants API

Currently it it costs $0.03 / session, and the tool is free until 11/17/2023.

But what counts as a session?

Here’s the official explanation from OpenAI: If your assistant calls Code Interpreter simultaneously in two different threads, this would create two Code Interpreter sessions (2 * $0.03). Each session is active by default for one hour, which means that you would only pay this fee once if your user keeps giving instructions to Code Interpreter in the same thread for up to one hour.

If you run multiple threads per hour, and this is starting to increase your spending, you can always revert to using open-sourced code generation LLMs like CodeLlama, which is SOTA on publicly available LLMs on coding tasks.

Final thoughts

There are a lot of complexities that go into building a chatbot.

For example, collaboration might be key for specific teams, where you’d need to bring on a non-technical person to prototype with prompts before they’re pushed into production.

Then, for some use-cases we’ve seen people do much better in terms of cost, latency and quality with other commercial models like PaLM or Claude, or open-sourced like Llama. With the Assistant API you’re limited to only using OpenAI’s models.

Finally, building a chatbot for production often requires more control over the setup, and the Assistant API might be useful for prototyping and spinning up MVP chatbots.

If you want more control over the retrieval processes, the ability to prototype with various prompts/models, collaborate with your team, and have better oversight over model completions, tokens, cost, and latency - we can help you.

Vellum has the tooling layer to experiment with prompts and models, evaluate their quality, and make changes with confidence once in production.

You can take a look at our use-cases, or book a call to talk with someone from our team.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Nov 16, 2023
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
ReAct agent for web search and page scraping
Gather information from the internet and provide responses with embedded citations.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Q&A RAG Chatbot with Cohere reranking
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.
E-commerce shopping agent
Check order status, manage shopping carts and process returns.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.