Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

GPT-5.2 Benchmarks

Breaking down OpenAI's GPT 5.2 model performance across coding, reasoning, and long-horizon planning.

8 min
Written by
Reviewed by
No items found.

OpenAI just dropped GPT-5.2 quickly after Gemini 3 Pro stunned the AI space. It’s seems to be built with an emphasis to power teams pushing real workloads with serious upgrades in performance across coding, math, planning, and multimodal tasks, compared to 5.1 back.

It’s available now through the API, with Instant, Thinking, and Pro variants rolling out across ChatGPT paid plans.

If you’re evaluating models for production agents or complex workflow automation, GPT-5.2 is positioned as a serious contender. Here’s what the numbers actually show.

💡 Want to see how GPT 5.2 compares to Gemini Pro 3, Claude Opus 4.5, or Grok 4.1 for your use case? Compare them in Vellum!

Key observations of reported benchmarks

While benchmarks are inherently limited and may not fully capture real-world utility, they are our only quantifiable way to measure progress. From the reported data, we can conclude a few things:

  • Reasoning: The most compelling data points are the high scores on ARC-AGI-2 (52.9%) and GPQA Diamond (92.4%). This massive leap in abstract reasoning, beating out Gemini 3 Pro (31.1%) and Claude Opus 4.5 (37.6%) indicating a core improvement in academic and abstract reasoning.
  • Coding: A new score of 55.6% on the challenging SWE-Bench Pro benchmark confirms its superior ability to handle real-world software engineering tasks across 4 coding languages rather than simply Python.
  • Math: A perfect score on AIME 2025 is impressive, but the strong performance on the new FrontierMath benchmark (40.3% on Tiers 1-3) is more indicative of a robust intrinsic base for mathematical logic, even without relying on coding tools.
  • Long-Horizon Planning: The results on GDPval are arguably the most indicative of practical utility. Beating or tying industry professionals on 70.9% of knowledge work tasks shows an unprecedented ability to handle long-horizon planning and coherent execution in a professional context.
  • Vision: High scores across MMMU-Pro (86.5%) and Video-MMMU (90.5%) suggest a powerful, natively multimodal architecture capable of reasoning across temporal and spatial dimensions simultaneously.

Coding capabilities

SWE-Bench Pro and SWE-Bench Verified evaluate a model's ability to resolve real-world software issues from GitHub repositories. Unlike the Python-only Verified version, SWE-Bench Pro tests across four languages and is designed to be more challenging and industrially relevant.

While GPT-5.2 Thinking sets a new state of the art of 55.6% on SWE-Bench Pro, on the more established SWE-Bench Verified, it scores 80.0%.

This comes neck to neck with Claude Opus 4.5 (80.9%) and surpasses Gemini 3 Pro (76.2%). This is an improvement from GPT 5.1 (76.3%) in complex, multi-language bug fixing positioning excellently for professional development workflows.

Reasoning capabilities

Reasoning benchmarks evaluate a model's ability to solve complex and novel problems. GPQA Diamond assesses PhD-level scientific knowledge, while ARC-AGI-1 and ARC-AGI-2 focus on abstract visual puzzles that resist memorization. These benchmarks are crucial for building agents that need to think and follow multi-step instructions.

GPT-5.2 Thinking scores 92.4% up 4.3% from GPT 5.1 on GPQA Diamond, giving it a slight lead over Gemini 3 Pro (91.9%) and a significant advantage on Claude Opus 4.5 (87%) on advanced scientific questions. The most notable upgrade is in abstract reasoning.

Its 52.9% score on ARC-AGI-2 is a massive jump over the scores of both Claude Opus 4.5 (37.6%) and nearly doubling Gemini 3 Pro’s performance (31.1%), indicating a fundamental improvement in non-verbal problem-solving.

Math capabilities

The AIME 2025 benchmark, based on a challenging math competition, tests quantitative reasoning. Performance on the newer FrontierMath benchmark is even more telling, as it evaluates capability on unsolved problems at the frontier of advanced mathematics.

GPT-5.2 Thinking caught up to Claude Opus 4.5 with a perfect 100% score on AIME 2025 with no tools, while Gemini 3 Pro lags 5% behind the rest.

The key differentiator is its performance on FrontierMath, where it scores 40.3% on Tiers 1-3, a ~10% improvement over GPT-5.1. This strong base performance shows a more robust innate mathematical intuition, making it less dependent on external tools to find a solution.

Work task capabilities

Beyond single-turn tasks, a model's ability to plan and execute multi-step workflows is a critical measure of its agentic capabilities. GDPval measures this by evaluating performance on well-specified knowledge work tasks across 44 professional occupations.

GPT-5.2 surprisingly beats or ties with top industry professionals on 70.9% of comparisons. This benchmark, which requests real work products like presentations and spreadsheets, is a powerful indicator of practical, real-world assistance. It demonstrates the model can reliably navigate complex work from start to finish, maintaining coherence and quality over long horizons.

Long context capabilities

A large context window's value depends on the model's ability to accurately retrieve information. The MRCRv2 benchmark tests this 'needle-in-a-haystack' capability by asking the model to find specific facts within a large volume of text.

GPT-5.2 Thinking demonstrates near-perfect recall, scoring 98% on the 4-needle test and 70% on the 8-needle test within its full context window (256K input tokens).

Compared to Gemini 3 Pro (77%) performance on the 8-needle test, GPT-5.2 Thinking came in at 85% mean match ratio at 128K input tokens. This shows GPT-5.2's context window is not just large but also highly reliable, allowing it to effectively use information buried in vast documents.

Vision capabilities

Natively multimodal models are assessed on their ability to understand and reason across different data types. MMMU-Pro, Video-MMMU, and CharXiv are key benchmarks for this integrated understanding of images, videos, and scientific figures.

On MMMU-Pro, GPT-5.2 scores 86.5% (90.1% with Python), a slight increase against its predecessor GPT-5.1 (85.4%) and still leading over Gemini 3 Pro (81%).

GPT-5.2 ranks higher than Gemini 3 Pro (87.6%) score on Video-MMMU, scoring 90.5%. This demonstrates its strength is not limited to static images, showing an advanced ability to comprehend dynamic video content.

On the CharXiv with Python benchmark GPT-5.2 comes in at a whopping 88.7% beating out Gemini 3 Pro (81.4%), confirming it’s superior ability to interpret complex data visualizations.

Tool calling capabilities

The ability to reliably use external tools is critical for building powerful agents. The Tau2-bench Telecom benchmark evaluates this by testing models on complex, real-world tool usage scenarios within the telecommunications industry.

GPT-5.2 Thinking achieves a score of 94.5% on this benchmark, a massive jump over Gemini 3 Pro’s (85.4%) performance yet still falling short against Claude Opus 4.5 (98.2%) on this benchmark.

Why this matters for your agents

GPT 5.2 has firmly taken the crown from Gemini 3 Pro and finally closed the gap with Claude Opus 4.5 on the benchmarks that drive real agent performance.

Its jump in work task execution is one of the most exciting signals in this release, showing it can beat or match professionals on real knowledge work and sustain coherent output across long, multi step workflows. Updating your agents to this model could mean large gains in the amount of work you are able to reliably automate, but only time will tell if these benchmarks hold up to real world use.

If your agents are still running 5.1 or older baselines, you are leaving capability and reliability on the table. Update your agents in Vellum to see what performance gains you could be missing out on!

{{general-cta}}

Extra resources

ABOUT THE AUTHOR
Nicolas Zeeb
Technical Content Lead

Nick is Vellum’s technical content lead, writing about practical ways to use both voice and text-based agents at work. He has hands-on experience automating repetitive workflows so teams can focus on higher-value work.

ABOUT THE reviewer
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.
lAST UPDATED
Dec 12, 2025
share post
Expert verified
Related Posts
LLM basics
December 4, 2025
8 min
Top 12 AI Workflow Platforms
Product Updates
December 3, 2025
12 min
Vellum Product Update | November
Model Comparisons
November 27, 2025
18 min
Flagship Model Report: Gpt-5.1 vs Gemini 3 Pro vs Claude Opus 4.5
LLM basics
November 27, 2025
14 min
Gumloop Alternatives (Reviewed & Explained)
LLM basics
December 11, 2025
12 min
AI Voice Agent Platforms Guide
Guides
December 3, 2025
7 min
Claude Opus 4.5 Benchmarks
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI) = {{roi-cta}}

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Prior authorization navigator
Automate the prior authorization process for medical claims.
Claims compliance review agent
Examines claim submissions for compliance and recommends corrections

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

Earnings call summarizer agent
Earnings call transcript into key takeaways and a 4 to 5 slide brief ready for Gamma.
Creative content generator agent
Give it a URL and a format, and it turns the source into finished creative content.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.
Objection capture agent for sales calls
Take call transcripts, extract objections, and update the associated Hubspot contact record.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Compliance review agent
Checks DPAs and privacy policies against your compliance checklist then scores coverage and make a plan.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

No items found.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Trust center RAG Chatbot
RAG chatbot for internal policy documents with reranking model and Google search.
Renewal tracker agent
Create an agent that scans HubSpot for deals with upcoming renewal dates in the next 60 days.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Creative content generator agent
Give it a URL and a format, and it turns the source into finished creative content.
Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Prior authorization review agent
Reviews prior authorization packets, checks them against plan criteria and outputs JSON
Review Comment Generator for GitHub PRs
Use predefined guidelines to write a code review comment for a GitHub PR.
Synthetic Dataset Generator
Generate a synthetic dataset for testing your AI engineered logic.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Financial Statement Review Workflow
Extract and review financial statements and their corresponding footnotes from SEC 10-K filings.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Build AI agents in minutes for

{{industry_name}}

Stripe transaction review agent
Analyzes recent Stripe transactions for suspicious patterns, flags potential fraud, posts a summary in Slack.
KYC compliance agent
Automates KYC checks by reviewing customer documents stored in HubSpot
Client portfolio review agent
Compiles weekly portfolio summaries from PDFs, highlights performance and risk, builds a Gamma presentation deck.
Contract review agent
Reviews contract text against a checklist, flags deviations, scores risk, and produces a lawyer friendly summary.
NDA deviation review agent
Reviews NDAs against your standard template, highlights differences, and sends a risk rated summary to Slack.
Compliance review agent
Checks DPAs and privacy policies against your compliance checklist then scores coverage and make a plan.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.