Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

How to Manage OpenAI Rate Limits as You Scale Your App?

Learn about the current rate limits and strategies like exponential backoff and caching to help you avoid them.

Written by
Reviewed by
No items found.

How Rate Limits are Enforced

Let’s begin by discussing how rate limits are enforced, and what tiers exist for different providers.

Rate limit measurements

OpenAI and Azure OpenAI enforce rate limits in slightly different ways, but they both use some combination of the following factors to measure request volume:

  • Requests per minute (RPM)
  • Requests per day (RPD)
  • Tokens per minute (TPM)
  • Tokens per day (TPD)

If any of the above thresholds are reached, your limit is hit and your requests may get rate limited.

For example, imagine that your requests per minute is 20 and your tokens per minute is 1000. Now, consider the following scenarios:

  • You send 20 requests, each requesting 10 tokens. In this case, you would hit your RPM limit first and your next request would be rate limited.
  • You send 1 request, requesting 1000 tokens. In this case, you would hit your TPM limit first and any subsequent requests would be rate limited.

Notably, rate limits can be quantized, where they can be enforced over shorter periods of time via proportional metrics. For example, an RPM of 600 may be enforced in per-second iterations, where no more than 10 requests per second are allowed. This means that short activity bursts may get you rate limited, even if you’re technically operating under the RPM limit!

What’s New / Updated for 2025

OpenAI’s latest rate limit increases (for GPT-5 and GPT-5-mini)

  • As of September 12, 2025, OpenAI raised TPM for GPT-5 Tier 1 from ~30,000 to 500,000 TPM. OpenAI Community+1
  • GPT-5-mini also saw similar big increases at its higher tiers. OpenAI Community+1

Azure OpenAI quotas & defaults

  • Azure OpenAI defines TPM and RPM quotas per region, per subscription, per model or deployment. Microsoft Learn+1
  • Example: GPT-4.1 (default level) has a TPM quota of 1,000,000 TPM in many regions for standard/default subscriptions. Microsoft Learn+1
  • When you allocate TPM to a deployment, an RPM limit is set proportionally (i.e. increasing TPM raises RPM). Microsoft Learn+1

How to Avoid Rate Limit Errors (Updated Advice for 2025)

  • Set max_tokens closely to what you need, not too high. Because TPM counts the max of your input tokens and the max_tokens you set. Microsoft Learn+1
  • Use retries with exponential backoff when you get 429 errors. Wait, then try again, increasing wait time up to a limit.
  • Spread out requests rather than sending many in a quick burst, even if under average minute limits.
  • Monitor your usage: see how much TPM & RPM you’re using, in the region, for that model. If you’re close to your limit, you might need to request more quota. Microsoft Learn+1

Add retries with exponential backoff

A common way to avoid rate limit errors is to add automatic retries with random exponential backoff. This method involves waiting for a short, random period (aka a “backoff”) after encountering a rate limit error before retrying the request. If the request fails again, the wait time is increased exponentially, and the process is repeated until the request either succeeds or a maximum number of retries is reached.

Here’s an example of how to implement retries with exponential backoff using the popular backoff Python module (alternatively, you can use Tenacity or Backoff-utils):

import openai
import backoff

client = openai.OpenAI()

# Run exponential backoff when we hit a RateLimitError. Wait a maximum of 
# 30 seconds and do not retry more than 5 times.
@backoff.on_exception(backoff.expo,
                      openai.RateLimitError,
                      max_time=30,
                      max_tries=5)
def send_prompt_with_backoff(**kwargs):
    return client.completions.create(**kwargs)
 
send_prompt_with_backoff(model="gpt-3.5-turbo", prompt="How does photosynthesis work")

This strategy offers several advantages:

  1. Automatic recovery: Automatic retries help recover from rate limit errors without crashes or data loss. Users may have to wait longer, but intermediate errors are hidden.
  2. Efficient retries: Exponential backoff allows quick initial retries and longer delays for subsequent retries, maximizing chance of success while minimizing user wait time.
  3. Randomized delays: Random delays prevent simultaneous retries, avoiding repeated rate limit hits.

Keep in mind that unsuccessful requests still count towards your rate limits for both OpenAI and Azure OpenAI. Evaluate your retry strategy carefully to avoid exceeding rate limits with unsuccessful requests.

Adding Exponential Backoff Logic in Vellum

Below is an interactive preview of a Vellum Workflow implementing exponential backoff logic. If the prompt node encounters an error, the workflow waits for 5 seconds and retries up to 2 times:

Responsive Embed

Optimize your prompts and token usage

While it’s straightforward to measure your RPM, it can be trickier to measure your TPM. By its simplest definition, a “token” is a segment of a word. When you send a request to an OpenAI API, the input is sliced up into tokens and the response is generated as tokens. Therefore, when thinking about TPM, you need to consider the amount of input and output tokens getting generated.

Not sure how to measure your inputs and outputs in terms of tokens?

Learn about OpenAI's Tiktoken Library and how to calculate your tokens programmatically here. 

OpenAI provides a parameter max_tokens that enables you to limit the number of tokens generated in the response. When evaluating your TPM rate limit, OpenAI and Azure OpenAI use the maximum of the input tokens and your max_tokens parameter to determine how many tokens will count towards your TPM. Therefore, if you set your max_tokens too high, you will end up using up more of your TPM per request than necessary. Always set this parameter as close as possible to your expected response size.

Prompt chaining

Additionally, instead of using very long prompts for a task, consider using prompt chaining.

Prompt chaining involves dividing a complex task into more manageable subtasks using shorter, more specific prompts that connect together. Since your token limit includes both your input and output tokens, using shorter prompts is a great way to manage complex tasks without exceeding your token limit.

We wrote more on this strategy in this article.

Use caching to avoid duplicate requests

Caching stores copies of data in a temporary location, known as a cache, to speed up requests for recent or frequently accessed data. It can also store API responses so future requests for the same information use the cache instead of the API.

Caching in LLM applications can be tricky since requests for the same information may look different. For example, How hot is it in London and What is the temperature in London request the same information but would not match in a simple cache.

Semantic caching solves this by using text similarity measures to determine if requests are asking for the same information. This allows different prompts to be pulled from the cache, reducing API requests. Consider semantic caching when your application frequently receives similar requests; you can use libraries like Zilliz’s GPTCache to easily implement it.

Model providers are also recognizing the need for native caching features in complex workflows. Google’s new context caching for Gemini models lets users cache input tokens and reuse them for multiple requests. This is particularly useful for repeated queries on large documents, chatbots with extensive instructions, and recurring analysis of code repositories. While OpenAI and Azure OpenAI don't support this yet, be on the lookout for future caching features to improve token efficiency.

How to increase your rate limits

If you’re still finding it hard to stay within your rate limits, your best option may be to contact OpenAI or Microsoft to increase your rate limits. Here’s how you can do that:

  • OpenAI: You can review your usage tier by visiting the limits section of your account’s settings. As your usage and spend on the OpenAI API goes up, OpenAI will automatically elevate your account to the next usage tier, which will cause your rate limits to go up.
  • Azure OpenAI: You can submit quota increase requests from the Quotas page of Azure OpenAI Studio. Due to high demand, Microsoft is prioritizing requests for customers who fully utilize their current quota allocation. It may be worth waiting for your quota allocation to be hit before submitting a request to increase your quota

Strategies for Maximizing Throughput

If you care more about throughput — i.e. the number of requests and/or the amount of data that can be processed — than latency, also consider implementing these strategies:

Add a delay between requests

Even with retrying with exponential backoff, you may still hit the rate limit during the first few retries. This can result in a significant portion of your request budget being used on failed retries, reducing your overall processing throughput.

To address this, add a delay between your requests. A useful heuristic is to introduce a delay equal to the reciprocal of your RPM. For example, if your rate limit is 60 requests per minute, add a delay of 1 second between requests. This helps maximize the number of requests you can process while staying under your limit.

import openai
import math
import time

client = openai.OpenAI()

REQUESTS_PER_MINUTE = 30
DEFAULT_DELAY = math.ceil(60 / REQUESTS_PER_MINUTE)

def send_request_with_delay(delay_seconds=DEFAULT_DELAY, **kwargs):
    time.sleep(delay_seconds)  # Delay in seconds
    # Call the Completion API and return the result
    return client.completions.create(**kwargs)

send_request_with_delay(
    model="gpt-3.5-turbo",
    prompt="How does photosynthesis work"
)

Batch multiple prompts into a single request

If you’re reaching your RPM limit but still have capacity within your TPM limit, you can increase throughput by batching multiple prompts into each request. This method allows you to process more tokens per minute.

Sending a batch of prompts is similar to sending a single prompt, but you provide a list of strings for the prompt parameter instead of a single string. Note that the response objects may not return completions in the same order as the prompts, so be sure to match responses to prompts using the index field.

import openai

client = openai.OpenAI()

def send_batch_request(prompts=[]):
	# An example batched request
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct-0914",
        prompt=prompts,
        max_tokens=100,
    )
	
    # Match completions to prompts by index
    completions = [] * len(prompts)
    for choice in response.choices:
        completions[choice.index] = choice.text
    return completions
	
	
num_prompts = 10
prompts = ["The small town had a secret that no one dared to speak of: "] * num_prompts

completions = send_batch_request(prompts)

# Print output
print(completions)

Closing Thoughts

Dealing with OpenAI rate limits can be tough.

They can obstruct honest uses of the API even if they were created to prevent abuse. By using tactics like exponential backoff, prompt optimization, prompt chaining, and caching, you can reasonably avoid hitting rate limits.

You can also improve your throughput by effectively using delays and batching requests. Of course, you can also increase your limits by upgrading your OpenAI or Azure tier.

FAQ: Managing rate limits across providers

1. What are rate limits, and why do they matter?

Rate limits cap how many requests or tokens you can send to a model within a certain timeframe. They prevent server overload, ensure fair use, and control infrastructure costs. For teams scaling apps, hitting these limits can cause delays, errors (429 responses), or downtime.

2. How do OpenAI rate limits work in 2025?

OpenAI enforces requests per minute (RPM) and tokens per minute (TPM).

  • Example: GPT-5 Tier 1 recently jumped to 500k TPM and 1,000 RPM. Higher tiers have even more headroom.
  • Token count includes both input and output tokens. If you set max_tokens too high, you might consume quota faster than expected.

3. How does Azure OpenAI handle quotas?

Azure uses per-region quotas. You’re allocated a TPM budget (e.g., 1,000,000 TPM in many regions for GPT-4.1). You divide that quota across deployments. RPM scales automatically with TPM. If you need more, you must request a quota increase via the Azure portal.

4. What about Anthropic?

Anthropic enforces limits by TPM + RPM, similar to OpenAI. As of 2025, Claude 3.5 Sonnet supports up to 1M TPM for enterprise accounts. For most developers, limits start lower and increase with usage and approval. Anthropic is rolling out context caching, which reduces repeated token costs across multi-turn chats.

5. What’s the best retry strategy if I hit a limit?

Use exponential backoff with jitter (random delay). Example: wait 1s, then 2s, then 4s, up to ~30s max. This prevents a “thundering herd” of retries. Libraries like backoff (Python) or built-in retry utilities help automate this.

6. How can I plan workloads to avoid rate limit pain?

  • Batch requests when TPM > RPM.
  • Pre-chunk documents so one request doesn’t blow past your TPM.
  • Cache frequent queries (semantic caching is better than keyword caching).
  • Delay between requests (e.g., if 60 RPM, add ~1s pause).

7. Can I increase my limits?

  • OpenAI: usage tiers rise automatically with spend. Higher tiers = more TPM/RPM.
  • Azure: submit a quota increase request in the Azure portal.
  • Anthropic: enterprise contracts get priority for higher limits.

8. How does Vellum help with rate limit issues?

Vellum workflows let you:

  • Add retries with exponential backoff visually, no custom code.
  • Monitor token + request usage across models in one place.
  • Switch between providers when one hits limits, without rewriting your app.
    This means your team can ship faster without worrying which vendor’s limit is slowing you down.

9. What’s the risk of relying only on retries?

Retries help, but every failed attempt still counts toward your quota. If you retry too aggressively, you may burn through TPM faster. With Vellum, you can simulate workflows, test prompt size, and optimize before deploying so you waste fewer tokens.

10. Should I standardize on one provider or spread across many?

  • One provider = simpler, but you’re locked into their limits.
  • Multi-provider setup = flexibility. If OpenAI caps you, you can route some traffic to Anthropic or Azure.
    Vellum makes this approach easier — you can configure workflows to automatically fall back to another provider if one gets rate limited.
ABOUT THE AUTHOR
Mathew Pregasen
Technical Contributor

Mathew Pregasen is a technical expert with experience with AI, infrastructure, security, and frontend frameworks. He contributes to multiple technical publications and is an alumnus of Columbia University and YCombinator.

ABOUT THE reviewer

No items found.
lAST UPDATED
Sep 23, 2025
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

ReAct agent for web search and page scraping
Gather information from the internet and provide responses with embedded citations.
LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Q&A RAG Chatbot with Cohere reranking
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Q&A RAG Chatbot with Cohere reranking
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.