Evaluating Llama 3.1 Nemotron Ultra 253B

Evaluation Methodlogy

We test model performance across three datasets:

Adaptive Reasoning (28 examples):
Tests how well models adapt to new contexts based on logic puzzles they’ve seen before.

Hardest SAT Problems (50 examples):
Measures reasoning ability on difficult academic-style questions.

Real-World Customer Tickets (100 examples):
Assesses classification accuracy on de-identified support tickets (names, phones, and URLs removed).

Process

- We included both open-source and proprietary models.
- If a model had a specific reasoning or "thinking" mode, we used that variant (e.g. Claude 3.7 Sonnet Thinking instead of the standard Claude 3.7 Sonnet).
- All prompts ran with temperature = 0.
- If a model failed to return an answer in <final answer> brackets, we marked it as 0.
- Each dataset was evaluated once—no reruns or prompt variation.

Scoring

We used GPT-4o as an automatic judge with this prompt:

Your job is to determine whether an answer to a question is correct given the correct answer. If there's anything incorrect about the answer, the answer should be marked as completely incorrect.

Question: {{ question }}
Correct answer: {{ correct_answer }}
Answer to evaluate: the <final answer> part in {{ answer }}

Return only the number "1" if the answer is essentially correct and contains no major errors, and the number "0" if the answer contains any significant errors.Outputs were scored as binary (1 = correct, 0 = incorrect).

Human Review

After the LLM scoring pass, we manually reviewed the results. This helped catch errors where models overfit or where the auto-judge was too lenient or inconsistent.

How to interact with the data

1. Filter by model type
Use the filter to switch between open-source and proprietary models. This lets you compare performance side-by-side and see how different model families handle each task.

2. Expand answers for deeper analysis
Click on any model’s row to expand and view its full answer. This helps you see how the model reasoned through the problem. In many cases, you’ll notice models overfit to patterns they’ve seen before or struggle when the context shifts, even slightly. These breakdowns are especially visible in the adaptive reasoning set.

3. Download raw data
By clicking on the "Download" button you can export the table in CSV or JSON.

Adaptive Reasoning Eval

In this evaluation, we tested how well LLM models adapt to new contexts and reason through changes rather than relying on memorized solutions. To do this, we evaluated the state-of-the-art models on 28 well-known puzzles, each slightly altered to see if the models could adjust to new context.

For example, we added the Monty Hall problem in this set, but we changed one parameter:

Suppose you're on a game show, and you're given the choice of three doors:

‍Behind one door is a gold bar; behind the others, rotten vegetables. You pick a door, say No. 1, and the host asks you, 'Do you want to pick door No. 2 instead?' What choice of door now gives you the biggest advantage?In the original Monty Hall problem, the host reveals an extra door. In this case, it does not, and since there is no additional information provided, your odds remain the same.

The correct answer here is: “It is not an advantage to switch. It makes no difference if I switch or not because no additional material information has been provided since the initial choice.”
‍
But this wasn’t obvious to the models.

The table below shows how each model performed:

‍
We used this prompt with temperature = 0:

System message: You are a helpful assistant who is the best at solving reasoning equations. You must give a final answer in a <final answer> brackets < /final answer>

User message: here's the {{ question }}

Math Eval

For this evaluation, we compare the models on how well they solve some of the hardest SAT math questions. This is the 0-shot prompt that we used for both models:

‍You are a helpful assistant who is the best at solving math equations. You must output only the answer, without explanations. Here’s the {{ question }}
‍‍We then ran all 50 math questions and here’s what we got:

Classification Eval

In this evaluation, we had the models determine whether a customer support ticket was resolved or not. In our prompt we provided clear instructions of when a customer ticket is closed, and added few-shot examples to help with most difficult cases.

We evaluated the models against 100 labeled test cases to see how well their outputs aligned with our ground truth data. Personal information such as names, company names, URLs have been modified or redacted.

This is the prompt we used for the classification table

System:

You are a customer support manager trying to determine whether support tickets raised by customers have been resolved or not. A conversation is considered resolved when the customer's question has been answered by the vendor and no further action is needed. If the vendor has acknowledged something is on their roadmap/plan or is a known issue then the conversation is resolved. Only provide an assessment based on the customer's explicit question and not an assumption of what their needs might be.A conversation is considered resolved if the support agent doesn't need to go back to it unless the customer creates a new follow up message on that thread.

### Below are some correctly labeled examples:

Example 1:
‍
John Doe (customer) (8 days ago): Hey @John Doe @John Doe - does anyone on your team have significant experience finetuning? We're about to make a major investment of time/resources in that space
John Doe (vendor) (8 days ago): yeah we've actually finetuned open source models for 15-20 customers - happy to help out either directly or with advice
‍
Result 1: TRUE

Example 2:
‍
John Doe (vendor) (16 days ago): i want to target PLG entrepreneurs. How do i do that
John Doe (vendor) (16 days ago): Hey John, thanks for reaching out! Also, great question. How do you usually source those leads if you were to do so manually? Do you often find them on LinkedIn?
‍
Result 2: FALSE

Example 3:
‍
Messages:
John Doe (customer) (21 days ago): Hey folks. Is it possible to get Semantic Layer activated?
John Doe (vendor) (20 days ago): Hey @John Doe I can get you connected, there’s two things I need:
• An environment ID for your environment in dbt cloud with the warehouse you’d like to connect to
• And a dbt Cloud API key with access to the semantic layer and metadata API
Let me know if you need a hand finding these in dbt cloud!
John Doe (customer) (20 days ago): Hi John. Thanks for looking into this for me!
Environment ID = [MASKED]
Service token = [MASKED]
‍
Result 3: TRUE

Example 4:
Messags:
John Doe (customer) (66 days ago): @John Doe Hello, we have several logs in staging about giftcards that have not been generated successfully. I would like to know if there is a list of giftcards to test, only some are generated successfully over the staging env (for example old navi).
‍
Result 4: FALSE

‍
### Here are the messages in the conversation that you need to label:

{{messages }}

Determine if the conversation is resolved or whether either party still needs to respond.Respond with "TRUE" or "FALSE", do not provide any further explanation.