Native support for SambaNova inference in Vellum

Now you can run Llama 3.1 405b, with 200 t/s via SambaNova on Vellum!

Written by

Anita Kirkovska

Reviewed by

CONTENTS

Inline evaluation / Guardrails: Ensure good system performance at run-time

This is some text inside of a div block.

The Llama 3.1 405B model, with its 405 billion parameters, offers exceptional capabilities but requires substantial computational resources.

Running this model effectively requires high-performance hardware, including multiple GPUs with extensive VRAM.

SambaNova addresses these computational demands through its SN40L Reconfigurable Dataflow Unit (RDU), a processor specifically designed for AI workloads. The SN40L features a three-tier memory system comprising on-chip distributed SRAM, on-package High Bandwidth Memory (HBM), and off-package DDR DRAM. This architecture enables the chip to handle models with up to 5 trillion parameters and sequence lengths exceeding 256,000 tokens on a single system node.

Today, they offer the Llama 3.1 405B model (comparable to GPT-4o) at speeds of up to 200 tokens per second—2x faster than GPT-4o.

With this integration, you can test the Llama 3.1 405B model, and evaluate how it compares with your current model selection.

How the native integration works

Starting today, you can enable the Llama 3.1 405b - Samba Nova model in your workspace.

To enable it, you only need to get your API key from your SambaNova profile, and add it as a Secret named SAMBANOVA on the “API keys” page:

Then, you should enable the model from your workspace, by selecting the secret you just defined:

Then, in your prompts and workflow nodes, simply select the model you just enabled:

What do you get with SambaNova

Comparison of 405 vs GPT-4o, check the leaderboard here.

SambaNova's integration with Vellum brings key advantages for developers working with the Llama 3.1 405B model:

Fast Performance: SambaNova Cloud runs Llama 3.1 405B at 200 tokens per second, which is 2x faster than running GPT4o.

Lower output cost: SambaNova's picing is $5 for input tokens and $10 for output tokens, compared to GPT-4o’s $5 for input and $15 for output.

Accurate Outputs: SambaNova keeps the original 16-bit precision of the model, so you get reliable and accurate results without cutting corners. Check how the Llama 3.1 450b compares with other models in our LLM leaderboard.

Handles Complex Applications: The platform is designed to support demanding use cases like real-time workflows and multi-agent systems, making it flexible for a variety of projects.

If you want to test the inference speed with SambaNova — get in touch! We provide the tooling & best practices for building and evaluating AI systems that you can trust in production.

The Llama 3.1 405B model, with its 405 billion parameters, offers exceptional capabilities but requires substantial computational resources.

Running this model effectively requires high-performance hardware, including multiple GPUs with extensive VRAM.

Today, they offer the Llama 3.1 405B model (comparable to GPT-4o) at speeds of up to 200 tokens per second—2x faster than GPT-4o.

With this integration, you can test the Llama 3.1 405B model, and evaluate how it compares with your current model selection.

How the native integration works

Starting today, you can enable the Llama 3.1 405b - Samba Nova model in your workspace.

To enable it, you only need to get your API key from your SambaNova profile, and add it as a Secret named SAMBANOVA on the “API keys” page:

Then, you should enable the model from your workspace, by selecting the secret you just defined:

Then, in your prompts and workflow nodes, simply select the model you just enabled:

What do you get with SambaNova

SambaNova's integration with Vellum brings key advantages for developers working with the Llama 3.1 405B model:

Fast Performance: SambaNova Cloud runs Llama 3.1 405B at 200 tokens per second, which is 2x faster than running GPT4o.

Lower output cost: SambaNova's picing is $5 for input tokens and $10 for output tokens, compared to GPT-4o’s $5 for input and $15 for output.

Handles Complex Applications: The platform is designed to support demanding use cases like real-time workflows and multi-agent systems, making it flexible for a variety of projects.

If you want to test the inference speed with SambaNova — get in touch! We provide the tooling & best practices for building and evaluating AI systems that you can trust in production.

ABOUT THE AUTHOR

Anita Kirkovska

Founding Growth Lead

An AI expert with a strong ML background, specializing in GenAI and LLM education. A former Fulbright scholar, she leads Growth and Education at Vellum, helping companies build and scale AI products. She conducts LLM evaluations and writes extensively on AI best practices, empowering business leaders to drive effective AI adoption.

ABOUT THE reviewer

No items found.

lAST UPDATED

Dec 9, 2024

Expert verified

Model Comparisons

November 27, 2025

•

18 min

Flagship Model Report: Gpt-5.1 vs Gemini 3 Pro vs Claude Opus 4.5

LLM basics

November 25, 2025

•

12 min

AI Voice Agent Platforms Guide

Guides

November 25, 2025

•

7 min

Claude Opus 4.5 Benchmarks

LLM basics

November 20, 2025

•

10 min

Gumloop vs. n8n vs. Vellum (Platform Comparison)

Guides

November 18, 2025

•

8 min

Google Gemini 3 Benchmarks

November 11, 2025

•

15 min

AI Agent Use Cases Guide to Unlock AI ROI

The Best AI Tips — Direct To Your Inbox

Latest AI news, tips, and techniques

Specific tips for Your AI use cases

No spam

Oops! Something went wrong while submitting the form.

Each issue is packed with valuable resources, tools, and insights that help us stay ahead in AI development. We've discovered strategies and frameworks that boosted our efficiency by 30%, making it a must-read for anyone in the field.

Marina Trajkovska

Head of Engineering