Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

3 Strategies to Reduce LLM Hallucinations

Methods and techniques to reduce hallucinations and maintain more reliable LLMs in production.

Written by
Reviewed by
No items found.

While language models are very efficient at solving down-stream tasks without supervision, they still have some practical challenges.

LLM Hallucination is one them; and a very important one.

When a language model hallucinates, it generates information that seems accurate but is actually false.

In this blog post we’ll cover three practical methods to reduce these hallucinations and maintain more reliable LLMs in production.

Ways To Reduce LLM Hallucinations

There are three practical ways to reduce LLM hallucinations:

  • Advanced prompting: when you want to rely on the model’s pre-trained knowledge;
  • Data augmentation: when the additional context is not fitting the model context window;
  • Fine-tuning: when you have a standardized task and sufficient training data.

Each of these approaches have different techniques, and we’ll cover them in the next sections.

Advanced Prompting

You can resort to advanced prompting methods if your use case mostly relies on the model’s pre-trained knowledge, and you don’t need to use domain-specific knowledge.

These advanced prompting techniques guide the model to better understand the task at hand and the output that you’d like to get.

Instruct The Model To Avoid Adding False Information

A popular practice nowadays is to clearly instruct the model not to spread false or unverifiable information. This instruction is usually added in the "system prompt".

The following system prompt used for Llama 2-Chat can be replicated and tested for your own use-case.


💬 If you don’t know the answer to a question, please don’t share false information.

Few Shot Prompting

Few-shot prompting reduces LLM hallucinations by providing a small number of specific examples to guide the model's responses.

This approach helps the model concentrate on the specific topic, making it easier for it to grasp the context and follow the format of the examples provided. However, its effectiveness depends on the quality of these examples; inaccurate or biased examples can lead to lower accuracy & sometimes more hallucinations.

Chain Of Thought Prompting

Chain-of-thought prompting guides the LLMs to generate reasoning steps before providing the final answers. You can simply instruct the LLM to “Think step-by-step” or you can give actual reasoning examples that you’d like your LLM to follow. To understand Chain of Thought better, read our guide.

However, chain-of-thought may introduce some new challenges. The potential of hallucinated reasoning is one of them.

In cases when you want to include additional context but the content is exceeding the model’s context window, you should use data augmentation techniques.

Data Augmentation

Data augmentation is a process where you equip your model’s pre-trained knowledge with proprietary data or external tools/knowledge.

Below we show two options on how to augment your model’s responses and minimize hallucinations. Keep in mind that these methods are more complex to implement.

Retrieval-Augmented Generation

RAG is a specific technique where the model’s pre-trained knowledge is combined with a retrieval system of your proprietary data.

This system actively searches a vector database with stored information to find relevant data that can be used in the model's response. RAG can pull in and utilize proprietary data in real-time to improve the accuracy and relevance of its responses.

Fell free to reach out if you’d like to incorporate this for your use-case.

Use Of External Tools

Integrating tools with LLMs can also decrease hallucinations. Luckily, language models like GPT-4 are smart enough to string function calls together and use multiple tools in a chain, collecting data, and planning and executing the given task.

These tools can include database calls, API invocations, scripts that perform data processing, or even separate models for specific tasks (like sentiment analysis, translation, etc.) which will in turn improve the accuracy of the outputs.

Handling this process is not simple and requires a lot of testing & experimentation. Vellum’s AI tooling can help ease this process — If you’d like tailored advice on your use case — let us know.

Fine Tuning

Fine-tuning is considered to be one of the most effective ways to reduce hallucinations when you have a standardized task and sufficient training data.

To start with fine-tuning, you need to collect a large number of high quality prompt/completions pairs, then experiment with different foundation models and various hyper-parameters like learning rate and number of epochs until you find the best quality for your use-case.

To learn more about when to use fine-tuning and how to do it, read this detailed guide.

How To Evaluate These Strategies

Once you implement some of these methods, you need to evaluate if they actually improve your outputs. To do this, you can work with human annotators, or you can use another LLM to evaluate the data for you. Most of the latest LLM models can evaluate your LLM outputs as good as human annotators, and 100x faster.

However, even if you speed up the process with LLM evaluator, building the whole workflow is still a complex process of its own. Below we share one proven strategy that works really well for our customers, and one that you can try on your own or with Vellum.

Testing Strategy To Minimize Hallucinations

The goal of this strategy is to generate enough test cases that will capture all of your edge cases, then select appropriate evaluation metrics and use the best model for the job.

Here’s a breakdown of this process:

Develop a Unit Test Bank

Create a set of test scenarios to evaluate the LLM's ability to handle various topics and avoid hallucinations.

The common understanding is that you’ll need historical data to create these test cases. While that’s true and it’s very useful, you can also use an LLM to create synthetic data for this purpose.

Select Appropriate Evaluation Metrics

Now, depending on your LLM task, you can use different evaluation metrics.

Here are two sets of metrics and their applications:

1. Semantic similarity + relevance metrics

Imagine you're using an LLM to generate responses to customer queries. After feeding a query to the LLM, it provides a response. To evaluate this response, you would use semantic similarity and relevance metrics to compare the LLM's response with a pre-existing, correct response to the same query.

2. Relevance, helpfulness, and authority metrics

These metrics are typically used in contexts where it's crucial to evaluate the quality and reliability of information provided, especially when dealing with factual data, advice, or expert opinions. For instance, consider a scenario where an LLM is used to provide financial advice or health information. In such cases, it's not just important for the LLM's responses to be semantically similar to known correct responses, but they also need to be relevant, helpful and credible.

If you want to read more on how to evaluate your models in production, check this guide.

Conclusion

Now that you're aware of the various methods for minimizing LLM hallucinations, it's important to remember that the right technique for your task depends on a few key factors.

You should consider your project objectives, the data available to you, understand why LLM hallucinations happen, and whether your team is capable of developing and evaluating these LLM techniques. You can also bundle more methods in your setup, like fine-tuning and using extra tools, for better results.

To really make sure these methods are minimizing hallucinations, you should build a workflow to evaluate them. This is key to making sure your chosen method is truly improving your LLM's performance and reliability.

ABOUT THE AUTHOR
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.
lAST UPDATED
Jan 3, 2024
share post
Expert verified
Related Posts
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
Guides
October 6, 2025
15
A practical guide to AI automation
LLM basics
September 25, 2025
8 min
Top Low-code AI Agent Platforms for Product Managers
LLM basics
September 25, 2025
8 min
The Best AI Agent Frameworks For Developers
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

SOAP Note Generation Agent
Personalized healthcare explanations of a patient-doctor match

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review and error detection

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

PDF Data Extraction to CSV
Extract unstructured data (PDF) into a structured format (CSV).

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

SOAP Note Generation Agent
Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.
Personalized healthcare explanations of a patient-doctor match
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.
Risk assessment agent for supply chain operations

Build AI agents in minutes for

{{industry_name}}

Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.
AI agent for claims review and error detection
E-commerce shopping agent
Retail pricing optimizer agent
Analyze product data and market conditions and recommend pricing strategies.
Risk assessment agent for supply chain operations
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.