Author

Anita Kirkovska

Founding Growth Lead

78 posts

Articles by Anita Kirkovska

Everything You Need to Know About GPT-5.5

OpenAI released GPT-5.5, the first fully retrained base model since GPT-4.5. Here's the full benchmark breakdown, how it compares to Claude Opus 4.7, pricing, and what developers are saying.

Model ComparisonsApr 25, 2026 • 12 min

AI psychosis is real, and you probably have it

Karpathy calls it AI psychosis. Garry Tan calls it cyber psychosis. Researchers call it brain fry. I call it competence addiction. Here's what's actually happening to the people building with AI.

UncategorizedApr 14, 2026 • 15 min

Everything You Need to Know About Claude Mythos

Anthropic published a 200+ page system card for Claude Mythos — their most capable model yet. Here's what's in it and why it matters.

UncategorizedApr 7, 2026 • 5 min

Understanding Logprobs: What They Are and How to Use Them

Learn what OpenAI's logprobs are and how can you use them for your LLM applications

GuidesDec 3, 2025 • 7 min

Document Data Extraction in 2026: LLMs vs OCRs

A choice dependent on specific needs, document types and business requirements.

GuidesDec 3, 2025 • 10 min

GPT-5 Benchmarks

See how GPT-5 performs across benchmarks; with a big focus on health

GuidesDec 3, 2025 • 5 min

Google's AP2: A new protocol for AI agent payments

How verifiable mandates are creating a secure foundation for AI-driven commerce.

GuidesDec 3, 2025 • 7 min

A Guide to LLM Observability

Think your APM tool has your AI covered? Think again. LLMs need their own observability playbook.

GuidesOct 17, 2025 • 20 min

OpenAI's Agent Builder Explained

A breakdown of OpenAI’s new Agent Builder and what it signals for the future of building and deploying AI agents.

AllOct 6, 2025 • 8 min

Zero-Shot vs Few-Shot prompting: A Guide with Examples

Exploring zero-shot & few-shot prompting: usage, application methods, and limits.

GuidesSep 23, 2025 • 7 min

Chain of Thought Prompting (CoT): Everything you need to know

We break down when Chain-of-Thought adds value, when it doesn’t, and how to use it in today’s LLMs.

GuidesSep 22, 2025 • 13 min

Build AI Products Faster: Top Development Platforms Compared

Compare top AI platforms for fast, reliable development in 2025.

LLM basicsSep 19, 2025 • 13 min

Understanding your agent’s behavior in production

You can’t improve what you can’t see, so start tracking every decision your agent makes.

GuidesSep 15, 2025 • 16 min

How can agentic capabilities be deployed in production today?

A practical guide to deploying agentic capabilities: what works, what doesn’t, and how to keep it reliable in prod.

GuidesSep 7, 2025 • 9 min

Partnering with Composio to Help You Build Better AI Agents

Building AI agents is 10x easier with 10,000+ tools and built-in LLM tooling support

GuidesAug 12, 2025 • 5 min

OpenAI o3 vs gpt-oss 120b

Just another eval confirming 90% discount with highest performance from GPT-OSS 120b.

Model ComparisonsAug 6, 2025 • 8 min

How to craft effective prompts

A curated list of best practices, techniques and practical advice on how to get better at prompt engineering.

GuidesAug 5, 2025 • 17 min

Subliminal Learning in LLMs

LLMs carry hidden traits in their data and we have no idea how.

GuidesJul 27, 2025 • 6 min

Introducing Vellum Agent Builder

Go from idea to AI workflow in seconds and continue to build in the UI or your IDE.

Product UpdatesJul 18, 2025 • 4 min

Introducing Custom Docker Images & Custom Nodes

Complete control over the business logic and runtime of your AI workflows in Vellum.

Product UpdatesJul 15, 2025 • 6 min

Big Ideas from the AI Engineer World’s Fair

What’s shaping AI products, agents, and infrastructure in 2025.

LLM basicsJun 8, 2025 • 12 min

10 Humanloop Alternatives in 2025

A side-by-side look at Humanloop and 10 other LLM platforms.

GuidesJun 3, 2025 • 16 min

Evaluation: Claude 4 Sonnet vs OpenAI o4-mini vs Gemini 2.5 Pro

Analyzing the difference in performance, cost and speed between the world's best reasoning models.

Model ComparisonsMay 23, 2025 • 9 min

How to connect a Vellum AI Workflow with your Lovable app

Build a functional chatbot using Vellum AI Workflows and Lovable with just a few prompts.

GuidesMay 13, 2025 • 6 min

How to evaluate an LLM evaluation framework

A quick guide to picking the right framework for testing your AI workflows.

GuidesApr 24, 2025 • 7 min

Evaluating models on adaptive reasoning, SAT questions & real-world classification tasks

Evaluating SOTA models if they can really reason

UncategorizedApr 14, 2025 • 2 min

Four Reasons Enterprise AI Projects Get Stuck

A wake up call to not underestimate the unique challenges of working with LLMs.

GuidesApr 14, 2025 • 5 min

MCP: The Hype vs. Reality

LLMs are stepping outside the sandbox. Should you let them?

GuidesApr 9, 2025 • 5 min

How Drata built an enterprise-grade AI solution with Vellum

See how Drata leveraged Vellum to build enterprise-grade AI workflows that enhance GRC automation.

Customer StoriesMar 18, 2025 • 6 min

Native integration with IBM’s Granite models

Support for IBM granite models in Vellum.

Product UpdatesMar 1, 2025 • 2 min

GPT-4.5 vs Claude 3.7 Sonnet

Comparing GPT-4.5 and Claude 3.7 Sonnet on cost, speed, SAT math equations, and adaptive reasoning skills.

Model ComparisonsFeb 28, 2025 • 9 min

GPT 4.5 is here: Better, but not the best

Feels more natural, hallucinates less, can be persuaded—and it’s not a game-changer.

GuidesFeb 27, 2025 • 7 min

Claude 3.7 Sonnet vs OpenAI o1 vs DeepSeek R1

Learn how the latest Anthropic's model compares to similar top-tier reasoning models on the market.

Model ComparisonsFeb 25, 2025 • 9 min

How RelyHealth Deploys Healthcare AI Solutions 100x Faster

Learn how Vellum enables Rely Health to rapidly build, test, and deploy AI-powered patient care solutions.

Customer StoriesFeb 20, 2025 • 6 min

How Revamp Reliably Runs 15M+ LLM Executions in Production

Learn how to optimize prompt versioning, debug efficiently, and make real-time updates to boost AI performance.

Customer StoriesFeb 10, 2025 • 6 min

Claude 3.7 Sonnet: Can It Actually Reason?

Evaluating the 'thinking' of Claude 3.7 Sonnet and other reasoning models to understand how they really reason.

GuidesJan 30, 2025 • 13 min

Analysis: OpenAI o1 vs DeepSeek R1

Explore how O1 and R1 perform on well-known reasoning puzzles—now tested in new contexts.

Model ComparisonsJan 30, 2025 • 9 min

Breaking down the DeepSeek-R1 training process—no PhD required

Learn how DeepSeek achieved OpenAI o1-level reasoning with pure RL and solved issues through multi-stage training.

GuidesJan 24, 2025 • 11 min

What to do when an LLM request fails

Rate limiting and downtime are common issues with LLMs — here’s how to manage it in production.

GuidesDec 16, 2024 • 7 min

Llama 3.3 70b vs GPT-4o

Learn how the latest model from Meta, Llama 3.3 70b compares to GPT-4o on three tasks

Model ComparisonsDec 10, 2024 • 8 min

Native support for SambaNova inference in Vellum

Now you can run Llama 3.1 405b, with 200 t/s via SambaNova on Vellum!

Product UpdatesDec 9, 2024 • 2 min

AI Development Survey: Help us build the ultimate AI changelog

Share your AI process in our 4-minute anonymous survey. Get early insights and a chance to win a MacBook M4 Pro.

LLM basicsNov 25, 2024 • 3 min

Announcing Native Support for Cerebras Inference in Vellum

Starting today, you can unlock 2,100 t/s with Llama 3.1 70B in Vellum for real-time AI apps.

Product UpdatesOct 24, 2024 • 4 min

How Glowing Personalized Hospitality Experiences with AI

Discover how Glowing leverages Vellum's Workflows to create innovative AI solutions for the hospitality industry.

Customer StoriesOct 1, 2024 • 5 min

OpenAI o1: Prompting Tips, Limitations, and Capabilities

Learn how to prompt OpenAI o1 models, understand their limits and the opportunities ahead.

GuidesSep 13, 2024 • 6 min

LLM Benchmarks: Overview, Limits and Model Comparison

Understand the latest benchmarks, their limitations, and how models compare.

GuidesSep 11, 2024 • 12 min

How Woflow Decoupled AI Updates for 50% Faster Delivery — Without the Infra Stress

Learn how Woflow sped up AI development by 50% — making it easier to handle errors, improve models and ship updates.

Customer StoriesSep 10, 2024 • 7 min

How this EdTech Company Made AI Development 10x Faster with Vellum

Explore how a leading EdTech company saves 50 eng hours per month and empowers everyone on the team to contribute.

Customer StoriesAug 28, 2024 • 7 min

The 6 Stages for Successful AI Implementation

Learn critical strategies to build and launch AI systems quickly and reliably.

GuidesAug 20, 2024 • 10 min

How Vellum Helped Odyseek Build Smarter AI Faster

Learn how Odyseek used Vellum to simplify AI development and improve team collaboration.

Customer StoriesAug 16, 2024 • 4 min

Llama 3.1 405b vs Leading Closed-Source Models

Discover How Llama 3.1 405b Stacks Up Against GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet on Three Tasks

Model ComparisonsJul 26, 2024 • 8 min

Evaluation: Llama 3.1 70B vs. Comparable Closed-Source Models

Explore Llama 3.1 70b's upgrades and see how it stacks up against same-tier closed-source models.

Model ComparisonsJul 24, 2024 • 8 min

Claude 3.5 Sonnet vs GPT-4o

Learn how Claude 3.5 Sonnet compares to GPT4o on data extraction, classification and verbal reasoning tasks.

Model ComparisonsJun 25, 2024 • 9 min

Llama 3 70B vs GPT-4: Comparison Analysis

Find out how Llama 3 70B stacks up against GPT-4 in terms of cost, speed, and performance on specific tasks.

Model ComparisonsMay 8, 2024 • 12 min

Rentgrata's Test Driven Journey to a Production-Ready Chatbot

Learn how Rentgrata used Vellum to evaluate their chatbot, and cut development time in half.

Customer StoriesMay 2, 2024 • 4 min

LlamaIndex vs LangChain Comparison

Discover what are the main differences between LangChain and LlamaIndex, and when to use them.

GuidesMay 1, 2024 • 13 min

RAG vs Fine-Tuning: How to Choose the Right Technique?

Learn how RAG compares to fine-tuning and the impact of both model techniques on LLM performance.

GuidesApr 30, 2024 • 10 min

Tutorial: Setting Up OpenAI Function Calling with Chat Models

Learn how to use OpenAI function calling in your AI apps to enable reliable, structured outputs.

GuidesApr 23, 2024 • 6 min

How Autobound Achieved a 20x Faster End-to-End LLM Iteration Cycle

Iterating on prompts using OpenAI's playground & Azure AI studio was challenging, until Autobound discovered Vellum.

Customer StoriesApr 11, 2024 • 4 min

Redfin's Test Driven Development Approach to Building an AI Virtual Assistant

Discover how Redfin used Vellum to develop and evaluate a production-ready AI assistant, now live in 14 markets.

Customer StoriesApr 9, 2024 • 7 min

How to Count Tokens Before you Send an OpenAI API Request

Learn how to use Tiktoken and Vellum to programmatically count tokens before running OpenAI API requests.

GuidesMar 27, 2024 • 6 min

Getting Started with Prompt Chaining

Learn how to improve LLM outputs, and make your setup more reliable using prompt chaining.

GuidesMar 26, 2024 • 5 min

How to Evaluate Your RAG System?

Learn how to use retrieval and content generation metrics to consistently evaluate and improve your RAG system.

GuidesMar 8, 2024 • 5 min

How can I get GPT-3.5 Turbo to follow instructions like GPT-4?

Learn prompt engineering tips on how to make GPT-3.5 perform as good as GPT-4.

GuidesFeb 15, 2024 • 10 min

How Lavender cut latency by half for 90K monthly requests in production

Learn how Lavender develops and manages more than 20 LLM features in production.

Customer StoriesFeb 13, 2024 • 5 min

Prompt Engineering Guide for Claude Models

Learn how to prompt Claude with these 11 prompt engineering tips.

GuidesFeb 2, 2024 • 10 min

How Codingscape improved time-to-market for their AI apps

Learn how Vellum helped Codingscape to ship AI apps quicker and win more projects.

Customer StoriesFeb 1, 2024 • 5 min

How can I use LLMs to classify user intents for my chatbot?

Learn how to build and evaluate intent handler logic in your chatbot workflow

GuidesJan 11, 2024 • 6 min

3 Strategies to Reduce LLM Hallucinations

Methods and techniques to reduce hallucinations and maintain more reliable LLMs in production.

GuidesJan 3, 2024 • 7 min

Four LLM hallucinations and ways to fix them

What is LLM hallucination & the four most common hallucination types and the causes for them

GuidesJan 1, 2024 • 5 min

Classifying Customer Tickets using Gemini Pro

Comparing the performance of Gemini Pro with zero and few shot prompting when classifying customer support tickets

GuidesDec 20, 2023 • 4 min

Best Model for Text Classification: Gemini Pro, GPT-4 or Claude2?

Comparing GPT3.5 Turbo, GPT-4 Turbo, Claude, and Gemini Pro on classifying customer support tickets.

Model ComparisonsDec 13, 2023 • 8 min

Tree of Thought Prompting: What It Is and How to Use It

Learn how to use Tree of Thought prompting to improve LLM results

GuidesNov 30, 2023 • 5 min

User Confidence in OpenAI vs. Alternative models/providers

Discover how recent OpenAI developments have influenced user confidence and interest in OpenAI alternatives

GuidesNov 28, 2023 • 8 min

First impressions with the Assistants API

Assistants API: Easy assistant setup with memory management - but what's under the hood?

GuidesNov 16, 2023 • 8 min

The ABC’s of Multimodal AI: Models, tasks and use-cases

How to use Multimodal AI models to build apps that solve new tasks and offer unique experiences for end users.

GuidesNov 6, 2023 • 7 min

Automatic data labeling with LLMs

LLMs can label data at the same or better quality compared to human annotators, but ~20x faster and ~7x cheaper.

GuidesNov 2, 2023 • 9 min

How Narya's team uses Vellum for auto data labeling & deployments

Learn how Vellum helped Narya.AI save time and make AI easy for everyone on their team.

Customer StoriesOct 25, 2023 • 4 min