Vellum is coming to the AI Engineering World's Fair in SF. Come visit our booth and get a live demo!

We don’t speak JSON

Why forcing LLMs to output structured data is a flawed paradigm, and what might come next for developers.

6 min
Written by
Reviewed by
No items found.

Let me start with a quick story. We recently hired a new intern. Young, brilliant, fresh out of an Ivy League school, and a real go-getter. We saw a ton of potential and were excited to get them started. To ramp them up, I invited them to a customer call.

The call was gold. The customer was venting about every UX pain point and roadblock they'd hit with our product. It was a treasure trove of feedback. To make it actionable, I asked the intern, "Hey, can you prepare a spreadsheet with all the insights from that call?" I wanted a structured format so we could start chipping away at the problems.

They came back with a jumble of chaos. The columns and rows didn't make sense, and parsing the values was a nightmare. There was even a column named "Vibes," where they just wrote how they felt about each piece of feedback. It was a non-starter.

Attempt two. "Okay," I said, "Forget the spreadsheet. Just generate a PDF report summarizing everything." A few days went by with no update. When I checked in, I saw a pretty uninspired document with very little progress. They were stuck.

Finally, I threw out the old playbook. "You know what? Just record a TikTok for me. Give me your quick reactions to the call." To my surprise, they nailed it. In two minutes, with floating text and subtitles, I got a perfectly clear, digestible summary of all the key takeaways.

This (slightly fabricated) anecdote leads to a simple realization: People perform better when they produce output in the language they were trained on.

The closer the output format is to our native way of thinking, the higher the quality of that output.

Language models are no different.

We don’t speak in JSON

So why should AI agents?

Over the last couple of years working with LLMs, I've realized they’re are no different. LLMs are trained on a vast bed of natural language tokens. They perform better when they get to "speak" the language they were trained on. And I'm using "language" loosely here, I mean the fundamental format of their output.

This brings me to the main point: We, as humans, don't speak JSON. We don't communicate with each other in nested key-value pairs. JSON is a concept we constructed so our computers can perform autonomous tasks consistently.

So, I have to ask the question: why are we imposing this same limitation on our AI agents?

How we arrived here

Let's rewind two and a half years to the early days of GPT wrappers. Our first major use case was simple: classifying intent. We needed to take unstructured text, like a tweet, and produce a structured output that our downstream automations and analytics could use.

Phase 1: Asking nicely (and the Regex nightmare)

Our first attempt was just to ask the model nicely.

We'd send a prompt like, "Can you please classify the input tweet as happy, sad, or angry?" The model, being a language model, would often respond with a full sentence: "Sure, based on the content, I would classify this tweet as 'happy'."

You could try to use regular expressions to yank out the word 'happy', but that approach gets incredibly hairy and brittle as soon as your desired output is more complex than a single word.

Phase 2: Few-Shot prompting

As our demands increased, the community developed the technique of "one-shot" or "few-shot" prompting. We started feeding the model an example of the exact output format we wanted, forcing it to speak a language our systems could understand.

The prompt became: "...produce the output as a JSON with a top-level key called 'sentiment'. For example: {"sentiment": "happy"}

This worked... decently.

But it came with a host of annoying developer experience issues. How many of us have had to parse JSON out of a markdown block with triple backticks? Or dealt with a model that hallucinates a misspelled key, like sentimant instead of sentiment? We were still writing complex parsers and building logic to account for every edge case.

Phase 3: Getting desperate (and a little threatening)

It's funny to look back on, but the next phase involved some pretty creative prompt engineering. Research and testing has shown that emotional prompting, like threats, can lead to better model outputs.

Our tests showed that threatening the model—"You MUST only respond with JSON. Failure to do so will have negative consequences" was improving adherence.

It was a clever hack, but it felt like a brute-force solution, not a first-principles approach. And it raises some... interesting questions about what we're putting in the training data for future AGI.

The "holy crap" moment

Then, on June 13, 2023, OpenAI released Function Calling. For me, and many others, this was the "holy crap" moment. Everything clicked. This was the breakthrough that promised to let us reliably get structured output from LLMs.

By allowing us to define a schema that the model was heavily incentivized to follow, function calling felt like the key to building robust systems and automations. It was the moment we could finally start building what we now call agents. In many ways, this was t=0 for agent development.

But JSON isn't the answer

Function calling has served us well, but as we build more complex agents, we're starting to see the cracks in this paradigm. The underlying problems haven't gone away.

  • No Formal Guarantees: At the end of the day, you still can't have a 100% formal guarantee on format or schema adherence. Model providers have added layers to validate the output, but this can lead to failed requests, forcing you to build complex retry logic.
  • Probabilistic vs. Deterministic: We are tasking a probabilistic model with a deterministic task. LLMs are famously not built for this kind of work. We're fitting a square peg into a round hole.
  • The Token Tax: Function calling works by injecting your JSON schemas into the system prompt. A significant number of tokens are spent just defining and then generating the JSON syntax—the curly braces, the quotes, the commas, the repeated keys. These are "useless tokens" that don't contribute to the actual insight or content we want from the model.

We're doing all of this just to get the LLM to speak our computer’s language, not its own.

So, what's next?

This brings us to the critical question for the next generation of AI development: Is there a way to get structured, reliable output in a format that is more language-friendly for the model?

The answer is likely yes. We're seeing the emergence of new techniques that move beyond forcing JSON. This includes grammar-based sampling (or logit biasing) that constrains the model's output at a token level to guarantee it conforms to a specific schema, effectively building the structure as it writes.

We’ve explored more natural structured formats like YAML or even custom, human-readable DSLs (Domain-Specific Languages). For instance, instead of forcing the model to output a JSON diff, we can ask it for a unified diff block directly in Markdown:

- if status == "active":
- 	send_email(user)
+ if user.is_active():
+ 	notify(user)

This is both easy for the model to generate (it’s trained heavily on code and diffs) and trivial for developers to parse. Similarly, YAML or a lightweight DSL often reduces errors while still giving downstream systems the structure they need.

The goal is to meet the model in the middle—to find a format that is both structured enough for our programs and natural enough for the LLM. Just as our intern thrived with TikTok, perhaps our agents will thrive when we stop asking for spreadsheets and start asking for something closer to their native tongue.

ABOUT THE AUTHOR
David Vargas
Full Stack Founding Engineer

A Full-Stack Founding Engineer at Vellum, David Vargas is an MIT graduate (2017) with experience at a Series C startup and as an independent open-source engineer. He built tools for thought through his company, SamePage, and now focuses on shaping the next era of AI-driven tools for thought at Vellum.

ABOUT THE reviewer
Anita Kirkovska
Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

No items found.
lAST UPDATED
Sep 15, 2025
share post
Expert verified
Related Posts
Guides
October 21, 2025
15 min
AI transformation playbook
LLM basics
October 20, 2025
8 min
The Top Enterprise AI Automation Platforms (Guide)
LLM basics
October 10, 2025
7 min
The Best AI Workflow Builders for Automating Business Processes
LLM basics
October 7, 2025
8 min
The Complete Guide to No‑Code AI Workflow Automation Tools
All
October 6, 2025
6 min
OpenAI's Agent Builder Explained
Product Updates
October 1, 2025
7
Vellum Product Update | September
The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska
Head of Engineering

This is just a great newsletter. The content is so helpful, even when I’m busy I read them.

Jeremy Hicks
Solutions Architect

Experiment, Evaluate, Deploy, Repeat.

AI development doesn’t end once you've defined your system. Learn how Vellum helps you manage the entire AI development lifecycle.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component, Use {{general-cta}}

Build AI agents in minutes with Vellum
Build agents that take on the busywork and free up hundreds of hours. No coding needed, just start creating.

General CTA component  [For enterprise], Use {{general-cta-enterprise}}

The best AI agent platform for enterprises
Production-grade rigor in one platform: prompt builder, agent sandbox, and built-in evals and monitoring so your whole org can go AI native.

[Dynamic] Ebook CTA component using the Ebook CMS filtered by name of ebook.
Use {{ebook-cta}} and add a Ebook reference in the article

Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Button Text

LLM leaderboard CTA component. Use {{llm-cta}}

Check our LLM leaderboard
Compare all open-source and proprietary model across different tasks like coding, math, reasoning and others.

Case study CTA component (ROI)

40% cost reduction on AI investment
Learn how Drata’s team uses Vellum and moves fast with AI initiatives, without sacrificing accuracy and security.

Case study CTA component (cutting eng overhead) = {{coursemojo-cta}}

6+ months on engineering time saved
Learn how CourseMojo uses Vellum to enable their domain experts to collaborate on AI initiatives, reaching 10x of business growth without expanding the engineering team.

Case study CTA component (Time to value) = {{time-cta}}

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.

[Dynamic] Guide CTA component using Blog Post CMS, filtering on Guides’ names

100x faster time to deployment for AI agents
See how RelyHealth uses Vellum to deliver hundreds of custom healthcare agents with the speed customers expect and the reliability healthcare demands.
New CTA
Sorts the trigger and email categories

Dynamic template box for healthcare, Use {{healthcare}}

Start with some of these healthcare examples

Population health insights reporter
Combine healthcare sources and structure data for population health management.
Healthcare explanations of a patient-doctor match
Summarize why a patient was matched with a specific provider.

Dynamic template box for insurance, Use {{insurance}}

Start with some of these insurance examples

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Insurance claims automation agent
Collect and analyze claim information, assess risk and verify policy details.
AI agent for claims review
Review healthcare claims, detect anomalies and benchmark pricing.

Dynamic template box for eCommerce, Use {{ecommerce}}

Start with some of these eCommerce examples

E-commerce shopping agent
Check order status, manage shopping carts and process returns.

Dynamic template box for Marketing, Use {{marketing}}

Start with some of these marketing examples

LinkedIn Content Planning Agent
Create a 30-day Linkedin content plan based on your goals and target audience.
Competitor research agent
Scrape relevant case studies from competitors and extract ICP details.

Dynamic template box for Sales, Use {{sales}}

Start with some of these sales examples

Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.

Dynamic template box for Legal, Use {{legal}}

Start with some of these legal examples

Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.

Dynamic template box for Supply Chain/Logistics, Use {{supply}}

Start with some of these supply chain examples

Risk assessment agent for supply chain operations
Comprehensive risk assessment for suppliers based on various data inputs.

Dynamic template box for Edtech, Use {{edtech}}

Start with some of these edtech examples

Turn LinkedIn Posts into Articles and Push to Notion
Convert your best Linkedin posts into long form content.

Dynamic template box for Compliance, Use {{compliance}}

Start with some of these compliance examples

No items found.

Dynamic template box for Customer Support, Use {{customer}}

Start with some of these customer support examples

Q&A RAG Chatbot with Cohere reranking
Trust Center RAG Chatbot
Read from a vector database, and instantly answer questions about your security policies.

Template box, 2 random templates, Use {{templates}}

Start with some of these agents

Agent that summarizes lengthy reports (PDF -> Summary)
Summarize all kinds of PDFs into easily digestible summaries.
Review Comment Generator for GitHub PRs
Generate a code review comment for a GitHub pull request.

Template box, 6 random templates, Use {{templates-plus}}

Build AI agents in minutes

Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
AI legal research agent
Comprehensive legal research memo based on research question, jurisdiction and date range.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.
Research agent for sales demos
Company research based on Linkedin and public data as a prep for sales demo.
SOAP Note Generation Agent
Extract subjective and objective info, assess and output a treatment plan.

Build AI agents in minutes for

{{industry_name}}

Clinical trial matchmaker
Match patients to relevant clinical trials based on EHR.
Prior authorization navigator
Automate the prior authorization process for medical claims.
Population health insights reporter
Combine healthcare sources and structure data for population health management.
Legal document processing agent
Process long and complex legal documents and generate legal research memorandum.
Legal contract review AI agent
Asses legal contracts and check for required classes, asses risk and generate report.
Legal RAG chatbot
Chatbot that provides answers based on user queries and legal documents.

Case study results overview (usually added at top of case study)

What we did:

1-click

This is some text inside of a div block.

28,000+

Separate vector databases managed per tenant.

100+

Real-world eval tests run before every release.