Search...

Index

Inline evaluation / Guardrails: Ensure good system performance at run-time

This is some text inside of a div block.

Chain of Thought Prompting (CoT): Everything you need to know

Chain of Thought prompting, when it's the right choice, and how it stacks up against other techniques.

Author

Anita Kirkovska

Sep 11, 2024

Large Language models (LLMs) are good at predicting the next word. But, they find it hard to solve problems that need step-by-step thinking.

Enter prompt engineering.

With the right prompts, you can guide a language model like GPT-4o or Claude 3.5 Sonnet to give better answers.

There's a whole toolkit of techniques to craft a good prompt, but when it comes to complex reasoning tasks, Chain of Thought prompting stands out as a solid option.

In this blog, we'll explore everything there is to know about Chain of Thought prompting, when it's the right choice, and how it stacks up against other techniques.

‍

What is Chain-of-Thought Prompting?

Chain-of-Thought (CoT) prompting is a technique that guides LLMs to follow a reasoning process when dealing with hard problems. This is done by showing the model a few examples where the step-by-step reasoning is clearly laid out. The model is then expected to follow that "chain of thought" reasoning and get to the correct answer.

Standard prompting vs Chain of Thought prompting

This technique is highly effective because it breaks down complex problems into more manageable parts. The approach allows models to focus on solving each part step-by-step, which boosts their accuracy.

Given its success with complex tasks, newer models like OpenAI o1 now embed this approach natively, making them even better at handling challenging problems, but require different set of prompting techniques.

‍

Standard CoT example

Solving a Linear equation

LLM models are getting better at math, but it’s still one of their biggest challenges. When you use a LLM to solve for math problem, it helps to guide the model through the steps it should take before reaching the final answer. Check the example below:

System message:

You’re the best mathematician in the world and you’ll help me solve linear equations.

Example:

For the equation 5x - 4 = 16

1. Add 4 to both sides: 5x - 4 + 4 = 16 + 4 → 5x = 20
2. Divide both sides by 5: 5x / 5 = 20 / 5 → x = 4

User Message:

Now solve 2x + 3 = 7

And we get this output from GPT-4o:


1. Subtract 3 from both sides:

   2x + 3 - 3 = 7 - 3 → 2x = 4

2. Divide both sides by 2:

   2x / 2 = 4 / 2< → x = 2

The model followed the intermediate steps and arrived to the correct answer. Usually, you might want the model to think through steps but only display the final result.

Now, usually, in your use-case you might want to let the model think (follow a step-by-step process) but you only want to surface the final output.

You can do this by either chaining your prompts or using techniques like placing the thinking process in a separate XML tag (e.g., <thinking>) and the answer in another (e.g., <answer>). Afterward, you can apply data transformation to filter and display only the answer to the end user. You can read about some of these techniques here.

This might look very similar to few-shot prompting, but there is a significant difference.

‍

Difference between Few-Shot prompting and Chain-of-Thought?

Few-shot prompting is when you give a few examples so the language model can understand want it should do. So the previous examples will not go into the intermediate steps.
The math example will look more like: "For the equation 5x - 4 = 16, The result is: x = 4."

On the other hand, Chain-of-Thought prompting is about showing the step-by-step thinking from start to finish, which helps with “reasoning” and getting more detailed answers.

Bottom line: It's about showing the work, not just the answer.

‍

When should you use Chain-of-Thought prompting?

CoT is ideal when your task involves complex reasoning that require arithmetic, commonsense, and symbolic reasoning; where the model needs to understand and follow intermediate steps to arrive at the correct answer. Just look at the benchmarking report in the image below that Claude release a few months ago. For all benchmarks that evaluate for reasoning (GPQA,MMLU,DROP, Big Bench) they use 3-shot or 5-shot CoT prompting!

In terms of model sizes, this technique works really well with bigger models (>100 billion parameters); think PaLM, and GPT-4o.

On the flip side, smaller models have shown some issues, creating odd thought chains and being less precise compared to standard prompting.

In other specific cases, you don’t even need to show the intermediate steps; you can just use Zero-Shot CoT prompting.

‍

What is Zero-Shot Chain-of-Thought prompting?

Zero-shot chain-of-thought (Zero-Shot-CoT) prompting involves adding "Let's think step by step" to the original prompt to guide the language model's reasoning process. This approach is particularly useful when you don't have many examples to use in the prompt.

Let's say you're trying to teach the AI about a new concept, like "quantum physics," and you want it to generate some explanations. Instead of just saying, "Explain quantum physics," you can just say "Let's think step by step: Explain quantum physics."

That’s it.

By including the "Let's think step by step" part, you help the AI break down complex topics into manageable pieces.

And you can do this on auto-pilot.

‍

Automatic chain of thought (Auto-CoT)

Automatic Chain of Thought or Auto-CoT automatically generates the intermediate reasoning steps by utilizing a database of diverse questions grouped into clusters.

Auto-CoT goes through two main stages:

Question Clustering: First, they partition questions of a given dataset into a few clusters. So, if people asked the computer program a bunch of questions about baking, like "How do I mix the ingredients?" and "What temperature should I bake a pie at?" these would go into different groups.
Demonstration Sampling: Once they have these question groups, they pick one question from each group and use Zero-Shot CoT prompt (basically the “Let’s think step by step” prompt). This way, the computer program generates clear and straightforward instructions on auto-pilot.

The process is illustrated below:

Outline of the Automated Chain of Thought process

‍

What's considered complex reasoning for LLMs today?

If we ask GPT-4o today to solve for x in the equation (64 = 2 + 5x + 32), it will solve it without any examples given. This may look like a simple math problem, but at the beginning of 2023 this was a very challenging problem even for GPT-4.

These days, it seems like the model automatically provides step-by-step answers to most reasoning questions by default. Go ahead, try it!

Now, just think about how much smarter an LLM can become when you provide it with a step-by-step guide to optimize your code, restructure your databases, or develop a game strategy for popular games like "Minecraft.”

And imagine how powerful this technique can be when scientists teach an AI to follow detailed step-by-step diagnosis for complex medical conditions.

The possibilities are endless, and that’s where these techniques come in handy, especially when we introduce the “visual” element to the mix.

‍

Multimodal Chain-of-Thought prompting

Multimodal Chain-of-Thought prompting uses both words and pictures to showcase the reasoning steps, to help guide the LLM to showcase its “reasoning”, and the right answer.

And if you were following the latest AI news, multi-modality is coming to an LLM near you.

ChatGPT can now see and talk utilizing GPT-4V(ision); and it can help you fix a bike seat if you share a picture of your bike, the manual, and your tools.

Well, with MultiModal Chain-of-Thought prompting you can lay out the reasoning tasks, share the photos upfront and get to the answer right away.

But, what are the limits to CoT prompting?

The biggest limit is that there is no guarantee of correct reasoning paths, and since we don’t really know if the model is really “reasoning” with us, this can lead to both correct and incorrect answers.

There are other prompt techniques like Self-Consistency which incorporate different “reasoning examples” for a single task and Tree of Thoughts (ToT) that has like a map of possible paths, and self-calibrates if it goes towards the wrong path. Apart from this prompting technique, you can follow some best practices on how to prompt these models - we've outlined all on this link.

‍

How to make the most of your CoT prompts?

No matter the prompt engineering technique you pick for your project, it's important to experiment, test, and understand what your end users think.

With Chain of Thought (CoT) prompting, it tends to do better with bigger models and tricky reasoning tasks. If you're making an app and this sounds like what you need, we can help.

Vellum.ai gives you the tools to try out different Chain of Thought prompts and models, check how good they are, and tweak them easily once they're in production — no custom code needed! Request to talk with our AI experts if you have any questions!

Large Language models (LLMs) are good at predicting the next word. But, they find it hard to solve problems that need step-by-step thinking.

Enter prompt engineering.

With the right prompts, you can guide a language model like GPT-4o or Claude 3.5 Sonnet to give better answers.

There's a whole toolkit of techniques to craft a good prompt, but when it comes to complex reasoning tasks, Chain of Thought prompting stands out as a solid option.

In this blog, we'll explore everything there is to know about Chain of Thought prompting, when it's the right choice, and how it stacks up against other techniques.

‍

What is Chain-of-Thought Prompting?

‍

Standard CoT example

Solving a Linear equation

System message:

You’re the best mathematician in the world and you’ll help me solve linear equations.

Example:

For the equation 5x - 4 = 16

1. Add 4 to both sides: 5x - 4 + 4 = 16 + 4 → 5x = 20
2. Divide both sides by 5: 5x / 5 = 20 / 5 → x = 4

User Message:

Now solve 2x + 3 = 7

And we get this output from GPT-4o:


1. Subtract 3 from both sides:

   2x + 3 - 3 = 7 - 3 → 2x = 4

2. Divide both sides by 2:

   2x / 2 = 4 / 2< → x = 2

The model followed the intermediate steps and arrived to the correct answer. Usually, you might want the model to think through steps but only display the final result.

Now, usually, in your use-case you might want to let the model think (follow a step-by-step process) but you only want to surface the final output.

This might look very similar to few-shot prompting, but there is a significant difference.

‍

Difference between Few-Shot prompting and Chain-of-Thought?

On the other hand, Chain-of-Thought prompting is about showing the step-by-step thinking from start to finish, which helps with “reasoning” and getting more detailed answers.

Bottom line: It's about showing the work, not just the answer.

‍

When should you use Chain-of-Thought prompting?

In terms of model sizes, this technique works really well with bigger models (>100 billion parameters); think PaLM, and GPT-4o.

On the flip side, smaller models have shown some issues, creating odd thought chains and being less precise compared to standard prompting.

In other specific cases, you don’t even need to show the intermediate steps; you can just use Zero-Shot CoT prompting.

‍

What is Zero-Shot Chain-of-Thought prompting?

That’s it.

By including the "Let's think step by step" part, you help the AI break down complex topics into manageable pieces.

And you can do this on auto-pilot.

‍

Automatic chain of thought (Auto-CoT)

Automatic Chain of Thought or Auto-CoT automatically generates the intermediate reasoning steps by utilizing a database of diverse questions grouped into clusters.

Auto-CoT goes through two main stages:

Question Clustering: First, they partition questions of a given dataset into a few clusters. So, if people asked the computer program a bunch of questions about baking, like "How do I mix the ingredients?" and "What temperature should I bake a pie at?" these would go into different groups.
Demonstration Sampling: Once they have these question groups, they pick one question from each group and use Zero-Shot CoT prompt (basically the “Let’s think step by step” prompt). This way, the computer program generates clear and straightforward instructions on auto-pilot.

The process is illustrated below:

‍

What's considered complex reasoning for LLMs today?

These days, it seems like the model automatically provides step-by-step answers to most reasoning questions by default. Go ahead, try it!

And imagine how powerful this technique can be when scientists teach an AI to follow detailed step-by-step diagnosis for complex medical conditions.

The possibilities are endless, and that’s where these techniques come in handy, especially when we introduce the “visual” element to the mix.

‍

Multimodal Chain-of-Thought prompting

Multimodal Chain-of-Thought prompting uses both words and pictures to showcase the reasoning steps, to help guide the LLM to showcase its “reasoning”, and the right answer.

And if you were following the latest AI news, multi-modality is coming to an LLM near you.

ChatGPT can now see and talk utilizing GPT-4V(ision); and it can help you fix a bike seat if you share a picture of your bike, the manual, and your tools.

Well, with MultiModal Chain-of-Thought prompting you can lay out the reasoning tasks, share the photos upfront and get to the answer right away.

But, what are the limits to CoT prompting?

‍

How to make the most of your CoT prompts?

No matter the prompt engineering technique you pick for your project, it's important to experiment, test, and understand what your end users think.

With Chain of Thought (CoT) prompting, it tends to do better with bigger models and tricky reasoning tasks. If you're making an app and this sounds like what you need, we can help.

ABOUT THE AUTHOR

Anita Kirkovska

Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.

talk with an AI Expert

LLM basics

June 8, 2025

•

5 min

Big Ideas from the AI Engineer World’s Fair

LLM basics

June 1, 2025

•

8 min

Build AI Products Faster: Top Development Platforms Compared

Customer Stories

May 30, 2025

•

5 min

How GravityStack Cut Credit Agreement Review Time by 200% with Agentic AI

Guides

May 28, 2025

•

7 min

How the Best Product and Engineering Teams Ship AI Solutions

Model Comparisons

May 23, 2025

•

8 min

Evaluation: Claude 4 Sonnet vs OpenAI o4-mini vs Gemini 2.5 Pro

Guides

May 16, 2025

•

7 min

Document Data Extraction in 2025: LLMs vs OCRs

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska

Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks

Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Book a DemoLearn more