Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

Understanding Logprobs: What They Are and How to Use Them

Learn what OpenAI's logprobs are and how can you use them for your LLM applications

Written by
Reviewed by
No items found.

LLMs are like smart text predictors. For every word or phrase they generate, they consider several possible next words and decide how likely each one is.

For example, if the model is trying to complete the sentence: “The best movie of all time is…” it might consider options like “The Godfather” or “Citizen Kane.” However, a choice like “Cats” would likely get a very low probability, close to 0%—not to judge, but the visual effects in that one were pretty rough!

When working with model outputs, particularly in machine learning and natural language processing, we often deal with probabilities, which indicate how likely an event (like predicting a word or a label) is to happen.

However, instead of using the actual probability percentages directly (like 10%), we use the logarithm of these probabilities. This is called the “log probability” or “logprob.”

For example, a logprob of “-1” corresponds to a probability of about 10% (in a logarithmic scale), but it’s easier to work with in calculations. The more negative the logprob, the lower the probability. For instance, a logprob of “-3” indicates a much lower probability than “-1”.

The GPT playground showing the log probabilities

Why Use Logprobs in Machine Learning?

We use logprobs because they make token prediction faster — and are easier for computers to work with.

  1. It’s cheaper for computers to do addition than it is to do multiplication. Figuiring the next token is easier when you’re adding the log probabilities of each token, instead of multiplying their actual probabilities.
  2. Optimizing the log probability (logprob) is more effective than optimizing the probability itself - the gradient (the direction and rate of change) of the logprob tends to be smoother and more stable, making it easier to optimize during training!

While primarily used by researchers to evaluate model performance, some providers like OpenAI are now offering this feature in their API, allowing users to adjust this parameter in their own LLM systems.

How Does OpenAI’s Logprobs Parameter Work?

OpenAI introduced the logprobs parameter in their API in 2023:

When logprobs is enabled, the API returns the log probabilities for each output token, along with a few of the most likely tokens and their log probabilities at each position. Here are the key parameters:

  • logprobs: If set to true, the API returns log probabilities for each token in the output. Note that this option isn’t available for the gpt-4-vision-preview model.
  • top_logprobs: A number between 0 and 5 that specifies how many of the most likely tokens to return at each position, along with their log probabilities. The logprobs parameter must be true to use this option.

The higher the log probability — the higher the likelihood of a token being the correct choice in that context. You can easily understand how "confident" the model is in its output, and you can also check other potential responses the model considered.

Learn more about it here.

But, what can you do with it?

What Can You Do with OpenAI’s Logprob Parameter?

You can leverage OpenAI’s logprobs to optimize your LLM in several ways, especially for tasks like classification, autocomplete, retrieval evaluation and minimizing hallucinations. You could use it in production as well as a moderation tool.

Let’s see some examples and how most of our customers  utilize it for developing their AI features:

Evaluating Classification

Sometimes, we use LLMs to classify content. By default, the models pick the token with the highest probability. However, we can use the logprobs parameter to check if the model’s response meets a specific logprob threshold.

Let’s say that we want to classify a user query into three categories “Product Info”, “Pricing” and “Need to talk with an Agent”.

Let’s say that this is our prompt:

CLASSIFICATION_PROMPT = """
You will be given a user query.
Classify the query into one of the following categories: Product Info, Pricing, or Need to Talk with an Agent.
Return only the name of the category, and nothing else.
Ensure your output is one of the three categories mentioned.
User query: {query}
"""

And let’s say that these are some of our user questions:

user_queries = [
    "Can you provide details on the new smartphone model’s photo-editing features?",
    "What is the cost of the new smartphone model with advanced photo-editing features?",
    "How can I track my delivery status in real-time?",
]

If we run these queries with our LLM it’s very obvious that the first one will fall under “Product info”, the second under “Pricing”, and the third could potentially be labeled as “Product Info”, but with lower probability.

It’s easy for us to identify these simple examples just by looking at them — but if we want to scale this approach, we can adopt the logprobs parameter in the API to check whether a given classification satisfies a specific “threshold”.

Let’s see how GPT-4 Turbo will classify these queries:

For example, if the model classifies the query "How can I track my delivery status in real-time?" with less than 100% probability, we can automatically route it to our "Talk with an Agent" branch, or expand our categories to include options like "Delivery and Tracking" for more accurate classification.

Detecting RAG Hallucinations

In our RAG-based systems, we usually pull context dynamically in our prompts to fix hallucinations and give the model more information of our knowledge. But even with this context, the model can hallucinate if the answer is not provided in these documents.

This is because these models are built to always give an asnwer, even when they don’t have the right answer.

You can use logprobs as a filter to evaluate retrieval accuracy. By setting a threshold, you ensure that only responses with a logprob close to 100% are considered reliable. If the logprob is lower, it indicates that the answer may not be found in the documents.

Building an Autocomplete Engine

You can use logprobs to improve autocomplete suggestions as a user is typing. By setting a high confidence threshold, you can ensure that only the most likely and accurate suggestions are shown, avoiding less certain or irrelevant options.

This makes the autocomplete experience more reliable and helpful.

Moderation Filters

Logprobs can help us screen responses to avoid rude, offensive, or harmful content. By creating an LLM evaluator, we can classify queries and block those with 100% confidence if they meet negative criteria.

Token Healing

LLMs use tokens to process and generate text, which can sometimes lead to issues with how prompts are handled.

For example, if the model is unsure how to finish a given URL in a prompt, logprobs reveal which tokens it thinks are likely, helping you tweak the prompt to get better results.

Here’s a simple example:

If your prompt is The link is <a href="http:, and the model struggles, logprobs can show which completions it’s considering. If the logprobs suggest the model isn’t sure about finishing the URL, you might adjust the prompt to The link is <a href="http, which could make it more likely to generate a complete URL correctly.

Why is this the case?

When you end a prompt with “http:”, the model might not complete it correctly because it sees “http:” as a separate token and doesn’t automatically know that “://” should come next. But if you end the prompt with just “http”, the model generates URLs as expected because it doesn’t encounter the confusing token split.

Using Logprobs to improve LLM features

Logprobs are handy during prototyping for spotting issues like hallucinations and capturing problematic token generation. They can also help build a solid classifier. In production, logprobs serve as a moderation tool, allowing you to easily isolate and address problematic prompts.

Using the logprob parameter can streamline your work and provide better structure than just tweaking prompts. Many of our customers are finding it valuable for building more reliable systems right from the start.

A good AI development platform can help with this experimentation.

With Vellum’s Prompt Playground, you can easily adjust your prompts and compare how different scenarios play out, whether you use the logprobs feature or not.

If you’re interested in comparing multiple prompts, with or without the logprob parameter — get in contact here.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Sep 3, 2024
share post
Expert verified
Related Posts
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
Guides
October 6, 2025
15
A practical guide to AI automation
LLM basics
September 25, 2025
8 min
Top Low-code AI Agent Platforms for Product Managers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

SOAP Note Generation Agent
Personalized healthcare explanations of a patient-doctor match

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

AI agent for claims review and error detection
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review and error detection
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Build AI agents in minutes for

{{industry_name}}

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
AI agent for claims review and error detection
E-commerce shopping agent
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Risk assessment agent for supply chain operations
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.