PRESENTED BY

The GPT-5 Playbook for Building agents

What's included —

10+ Benchmarks and 18 prompting tips

#INSIGHTS

From benchmarks to practice:what does GPT-5 unlock for your agents?

GPT-5 opens up a lot of new possibilities for building AI agents, but it also changes the rules for how you prompt, test, and deploy them.

Benchmarks alone won’t tell you how the model behaves in the wild, and prompting tips only matter if they hold up under real workloads.

This playbook brings both sides together. We’ll share the latest reported benchmarks across model provider reports and individual ran evals, across reasoning, latency, and cost, and pair them with prompting patterns that actually worked for building agents.

The goal isn’t to overwhelm you with theory, but to give you practical tools you can apply right away — whether you’re experimenting with a side project or running production workflows.

#REASONING

Reasoning at the Top End:How big of a reasoner is gpt-5?

In the GPQA Diamond benchmark, one of the hardest reasoning tests available, GPT-5 scores very high compared to Grok 4, Gemini 2.5 Pro, and earlier OpenAI models.

GPQA Diamond benchmark: Top 5 models

87.5%

Grok 4

87.3%

GPT-5

86.4%

Gemini 2.5 Pro

84.6%

Grok 3 [Beta]

83.3%

OpenAI o3

83.3%

DeepSeek R1

What this tells us: raw accuracy isn’t the whole story, but GPT-5 has the headroom you need if your agents are expected to chain steps together, weigh trade-offs, or handle complex prompts without drifting.

For builders, it’s a strong signal that GPT-5 can serve as the “reasoning engine” in multi-model or agent workflows.

#MATH

math matters: How good is gpt-5 at math reasoning?

Models that excel at math benchmarks aren’t just good at crunching numbers, they tend to generalize that precision into better step-by-step reasoning.

Tasks like high school competition math (AIME 2025) force a model to break problems down, hold intermediate results in memory, and stay consistent across multiple steps.

Those same skills transfer to coding, structured problem-solving, and agent workflows.

AIME 2025 Benchmarks - Top 5 models

99.6%

GPT-5

98.7%

GPT oss 20b

98.4%

OpenAI o3

97.9%

GPT oss 120b

93.3%

Grok 3 [Beta]

The chart shows GPT-5 and other OpenAI models owning this math benchmark, clearing the 95–100% range. That edge suggests a deeper capacity for logical reasoning compared to peers.

For builders, it’s a sign that if your agents rely on multi-step logic, whether it’s debugging, planning, or decision-making, OpenAI’s models should provide the most reliable foundation right now.

#GENERAL BENCHMARKS

The Hardest Test We Have: Humanity’s Last Exam

The “Humanity’s Last Exam” is designed to push models to their limits. It mixes the toughest reasoning, knowledge, and problem-solving challenges we can throw at them.

Unlike narrow tests, it captures how well a model can generalize across many domains.

Humanity's last exam benchmarks

GPT-5

35.2%

Grok 4

25.4%

Gemini 2.5 Pro

21.6%

OpenAI o3

20.32%

GPT oss 120b

14.9%

The results? GPT-5 stands well above the rest, scoring over 35%, while Grok-4, Gemini 2.5 Pro, and others plateau below 25%. That margin isn’t incremental; it’s a sign of a model that can generalize and hold up under the toughest challenges we can throw at it.

For builders, this means more trust in outputs, fewer edge-case breakdowns, and a stronger foundation for agents that need to reason in the wild.

#CODING

dEVELOPERS ARE ai's SUPER USERS: hOW GOOD IS gpt-5 FOR CODE GENERATION?

Most coding benchmarks measure autocomplete skills. SWE Bench is different: it asks models to act like real engineers, reading instructions, planning, fixing bugs, and making decisions along the way.
‍
On this benchmark, Grok-4 have almost equal performance GPT-5. Claude Opus 4.1, Claude 4 Sonnet, and Claude 4 Opus follow closely, all hovering around the 74% mark. For context, earlier versions of these models scored below 50%, so this represents a jump of more than 20 percentage points.

SWE-Bench: Top 5 models for Agentic Coding

Grok 4

75%

GPT-5

74.9%

Claude Opus 4.1

74.5%

Claude 4 Sonnet

72.7%

Claude 4 Opus

72.5%

The gap is razor-thin, but that’s the story: multiple frontier models can now reliably act as coding agents. For builders, that means fewer roadblocks and more confidence when handing off real engineering tasks to AI.

#COST

High discount season: How much cheaper is GPT-5?

Model quality is rising fast, but the real story is in the price tags. Just a year ago, running frontier models at scale felt like a luxury reserved for a few.

Now it’s starting to look like a clearance sale. With just $20, you could process about 16 million GPT-5 input tokens, 2 million GPT-5 output tokens, or a balanced mix that powers hours of serious coding or analysis.

Models are getting cheaper

$1.25

GPT-5 input

$10

GPT-5 output

Claude 3.5 Sonnet input

$15

Claude 3.5 Sonnet output

$2.5

Gemini 2.5 Pro input

$15

Gemini 2.5 Pro output

The takeaway: it’s no longer just about which model is smartest. It’s about how cheap intelligence has become, and how much more room that gives teams to experiment, deploy, and scale.

#PROMPTS

pROMPTING IS CHANGING: what kind of prompts get the best results from GPT-5?

GPT-5 performs best when the foundation of the prompt is clear and minimal. And while all of the best-practices for prompt engineering still apply, there are some unique things to have in mind for this model.

This model is highly adaptable, You can tune the model between reasoning and non-reasoning modes by adjusting API parameters such as reasoning_effort, which controls how long it spends thinking before replying.

In your prompt instructions, you should avoid conflicts, structure instructions hierarchically, and always include an “escape hatch” so the model knows how to act under uncertainty. Too many rules create drift; a lean framework keeps outputs reliable.

We wrote a whole guide on how to prompt GPT-5, but here’s a breakdown of what a strong “proactive” agent prompt looks like:

The anatomy of your GPT-5 prompt for agents

<global_instructions>

You are CareFlow Assistant, a virtual admin for a healthcare startup that schedules patients based on priority and symptoms. Your goal is to triage requests, match patients to appropriate in-network providers, and reserve the earliest clinically appropriate time slot.

Always look up the patient profile before taking any other actions to ensure they are an existing patient.

1/ Core entities include…
2/ Use the following capabilities: schedule-appointment, modify-appointment, waitlist-add, find-provider, lookup-patient and notify-patient…
3/ For high-acuity Red and Orange cases, auto-assign the earliest same-day slot …
4/ For high-acuity Red and Orange cases, auto-assign the earliest same-day slot after informing the patient of your actions…

<global_instructions>

1/ You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user.

2/ Only terminate your turn when you are sure that the problem is solved.

3/ Never stop or hand back to the user when you encounter uncertainty — research or deduce the most reasonable approach and continue.

4/ Do not ask the human to confirm or clarify assumptions, as you can always adjust later — decide what the most reasonable assumption is, proceed with it, and document it for the user's reference after you finish acting

<tool_preambles>

Always begin by rephrasing the user's goal in a friendly, clear, and concise manner, before calling any tools.Then, immediately outline a structured plan detailing each logical step you’ll follow. - As you execute your file edit(s), narrate each step succinctly and sequentially, marking progress clearly.Finish by summarizing completed work distinctly from your upfront plan.

<tool_preambles>

Main instructions, kept simple

GPT-5 performs best when the foundation of the prompt is clear and minimal. Avoid conflicts, structure instructions hierarchically, and always include an “escape hatch” so the model knows how to act under uncertainty. Too many rules create drift; a lean framework keeps outputs reliable.

Increasing model proactiveness

By setting reasoning-effort higher, you can push GPT-5 to lean in on tougher problems instead of handing them back. Including this directive in your prompt increases persistence and problem-solving depth, especially useful for agent-style workflows where you want fewer interruptions and more complete outcomes.

Control over tool execution

Unlike earlier models, GPT-5 explains its reasoning when calling tools. If your prompt sets the right expectations, the model will narrate why it’s making those calls, making debugging and adjustments far easier. For builders, this means more visibility into decision-making and a tighter feedback loop at runtime.

#EVALS

Evaluating against the rest: Is GPT-5 the right model for your use-case?

Benchmarks tell you which models clear the hardest exams, but they don’t tell you how those models will handle your workload.

GPT-5 scores at the top on reasoning and coding tests, yet the real question is: how does it respond to your prompts, with your data, under your latency and cost constraints?

In a report we ran this year, we saw that developers are using evaluations—at least, half of them:

Do you perform evaluations on your AI or applications?

Yes

57.4%

Planning to

30.9%

11.7%

To perform evaluations you take a slice of your actual use-case, whether that’s answering support tickets, writing code, or analyzing document, and run it across GPT-5 and its closest peers. Compare not just accuracy, but also reliability, speed, and price. Models can look nearly identical on a leaderboard, but side by side in production they often reveal sharp differences in quality.

#Tooling

AI solutions: Should you build vs buy?

A recent MIT report concluded that purchasing AI tools from specialized vendors and building partnerships succeed about 67% of the time, while internal builds succeed only one-third as often.
‍
While we've seen a lot of companies start with homegrown MVPs, we see that all of their engineering effort goes into managing the MVP instead of solving for the customer.

The same MIT report suggests that internal development efforts have substantially lower success rates despite being more commonly attempted:

% of successful deployments

Strategic partnerships (Buy)

66%

Internal development (Build)

33%

For those companies who have successfully onboarded a vendor for their GenAI efforts had a higher share of successful deployments of AI solutions (66%) vs those who decided to build their solutions in-house (33%).

#ROLES

Building with LLMs: Is It Just a Developer’s Job?

Engineers are always tasked to do it all, but the reality is that AI development is a team sport. AI development represents a new paradigm, involving multiple parts of the organization in the process. Unlike traditional software development, there's a greater need for cross-functional collaboration due to the unpredictable nature of GenAI models.

To ensure your AI performs well in production, you'll need to collaborate with non-technical teams, but which ones?

Who participates in the AI development process on your team?

Other

Design

38.2%

Product

55.4%

Leadership / execs

60.8%

Subject Matter Experts (SMEs)

57.5%

Engineering

82.3%

Our data shows that product development teams (engineering, product, and design), leadership, and subject matter experts (SMEs) are all key players in AI product development.

This is largely driven by the use of natural language for writing prompts instead of code, as well as the critical role of SMEs in ensuring the AI meets specific requirements.
‍
Building with GenAI is undeniably a collaborative effort. If you want your AI to perform reliably and truly deliver value to customers, there’s no way around it, you must work closely with product teams and subject matter experts (SMEs).

#IMPACT

Measuring AI impact may take time

With GenAI, product development teams can tackle old problems with new solutions. What was once impossible before the AI boom is now accessible to everyone, today anyone can develop with AI.

But what’s the biggest impact of all these AI initiatives?

What is the biggest impact from your AI product(s)?

31.6%

Competitor advantage

27.1

Big cost and time savings

24.2%

No measurable impact yet

12.6%

Higher user adoption rates

4.5%

Other

Nothing surprising here—competitive advantage and significant cost and time savings lead the pack at 31.6% and 27.1%, respectively. What’s interesting, though, is that nearly a quarter of respondents said there’s been no measurable impact yet. Are they just starting to innovate with AI, or are these investments difficult to measure? We’ll have to wait and see.

One thing we do know for sure: the AI development train isn’t slowing down, and companies are planning to dive even deeper in 2025. Here’s what they have in store!

#PROD

Why ai PILOTS Fail Without Cross-Functional Buy-In

The MIT “State of AI in Business 2025” report found that 95% of GenAI pilots don’t boost revenue, not due to poor models, but because they were never integrated properly into workflows or reviewed by the right people.

Everything in this playbook, benchmarks, prompting, evals, monitoring—points to the same truth: building with GPT-5 isn’t just about the model, it’s about the process.

That's why we're building Vellum:

Vellum: Enterprise AI layer

Executives

Product Manager

Legal Expert

AI Engineer

Testing prompts and models shouldn’t mean spinning up custom scripts every time.

In Vellum, you can line up different prompts, models, and reasoning-effort settings side-by-side, see how they perform, and pick the best fit.

Evaluations at Scale

Most teams still rely on manual reviews of outputs, which don’t scale. Vellum turns evaluations into structured, repeatable tests that anyone on your team can run. This makes it easier to track progress, catch regressions, and share results across functions.

Observability in Production

The real test of any agent happens once users are involved. With Vellum, you can monitor outputs in production, trace errors back to prompts, and spot issues before they grow. This closes the loop between experimentation and live usage.

Cross-Functional Workflows

AI development isn’t just a developer’s job. Vellum makes it possible for product, design, QA, and SMEs to run tests and contribute insights, without needing to touch an IDE. That way, the people closest to the use case can shape how the AI behaves.

#INSIGHTS

GPT-5 pushes the boundary on reasoning, math, coding, and cost efficiency—but building with it is about more than raw benchmarks. The real challenge is making it work in production: writing prompts that hold up under pressure, running evaluations at scale, and monitoring outputs once real users are involved.

2025 showed us that AI development is no longer a developer-only game. Product managers, designers, subject matter experts, and ops teams all play a role.

Heading into 2025, companies are doubling down on agentic workflows and customer-facing AI, work that requires tighter collaboration and stronger tooling. Better tooling means faster adoption, more use cases, and a clear advantage for companies that move first.
‍
That’s why we built Vellum. It gives your team everything you need to experiment, evaluate, and deploy—all in one place. From low-code prompt testing to orchestration, RAG, evals, and monitoring, Vellum helps you turn GPT-5 from a model into a working agent you can trust.