Categories: FAANG

Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI

ML 19175 image 1

Evaluating the performance of large language models (LLMs) goes beyond statistical metrics like perplexity or bilingual evaluation understudy (BLEU) scores. For most real-world generative AI scenarios, it’s crucial to understand whether a model is producing better outputs than a baseline or an earlier iteration. This is especially important for applications such as summarization, content generation, or intelligent agents where subjective judgments and nuanced correctness play a central role.

As organizations deepen their deployment of these models in production, we’re experiencing an increasing demand from customers who want to systematically assess model quality beyond traditional evaluation methods. Current approaches like accuracy measurements and rule-based evaluations, although helpful, can’t fully address these nuanced assessment needs, particularly when tasks require subjective judgments, contextual understanding, or alignment with specific business requirements. To bridge this gap, LLM-as-a-judge has emerged as a promising approach, using the reasoning capabilities of LLMs to evaluate other models more flexibly and at scale.

Today, we’re excited to introduce a comprehensive approach to model evaluation through the Amazon Nova LLM-as-a-Judge capability on Amazon SageMaker AI, a fully managed Amazon Web Services (AWS) service to build, train, and deploy machine learning (ML) models at scale. Amazon Nova LLM-as-a-Judge is designed to deliver robust, unbiased assessments of generative AI outputs across model families. Nova LLM-as-a-Judge is available as optimized workflows on SageMaker AI, and with it, you can start evaluating model performance against your specific use cases in minutes. Unlike many evaluators that exhibit architectural bias, Nova LLM-as-a-Judge has been rigorously validated to remain impartial and has achieved leading performance on key judge benchmarks while closely reflecting human preferences. With its exceptional accuracy and minimal bias, it sets a new standard for credible, production-grade LLM evaluation.

Nova LLM-as-a-Judge capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence.

How Nova LLM-as-a-Judge was trained

Nova LLM-as-a-Judge was built through a multistep training process comprising supervised training and reinforcement learning stages that used public datasets annotated with human preferences. For the proprietary component, multiple annotators independently evaluated thousands of examples by comparing pairs of different LLM responses to the same prompt. To verify consistency and fairness, all annotations underwent rigorous quality checks, with final judgments calibrated to reflect broad human consensus rather than an individual viewpoint.

The training data was designed to be both diverse and representative. Prompts spanned a wide range of categories, including real-world knowledge, creativity, coding, mathematics, specialized domains, and toxicity, so the model could evaluate outputs across many real-world scenarios. Training data included data from over 90 languages and is primarily composed of English, Russian, Chinese, German, Japanese, and Italian.Importantly, an internal bias study evaluating over 10,000 human-preference judgments against 75 third-party models confirmed that Amazon Nova LLM-as-a-Judge shows only a 3% aggregate bias relative to human annotations. Although this is a significant achievement in reducing systematic bias, we still recommend occasional spot checks to validate critical comparisons.

In the following figure, you can see how the Nova LLM-as-a-Judge bias compares to human preferences when evaluating Amazon Nova outputs compared to outputs from other models. Here, bias is measured as the difference between the judge’s preference and human preference across thousands of examples. A positive value indicates the judge slightly favors Amazon Nova models, and a negative value indicates the opposite. To quantify the reliability of these estimates, 95% confidence intervals were computed using the standard error for the difference of proportions, assuming independent binomial distributions.

Amazon Nova LLM-as-a-Judge achieves advanced performance among evaluation models, demonstrating strong alignment with human judgments across a range of tasks. For example, it scores 45% accuracy on JudgeBench (compared to 42% for Meta J1 8B) and 68% on PPE (versus 60% for Meta J1 8B). The data from Meta’s J1 8B was pulled from Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.

These results highlight the strength of Amazon Nova LLM-as-a-Judge in chatbot-related evaluations, as shown in the PPE benchmark. Our benchmarking follows current best practices, reporting reconciled results for positionally swapped responses on JudgeBench, CodeUltraFeedback, Eval Bias, and LLMBar, while using single-pass results for PPE.

Model	Eval Bias	Judge Bench	LLM Bar	PPE	CodeUltraFeedback
Nova LLM-as-a-Judge	0.76	0.45	0.67	0.68	0.64
Meta J1 8B	–	0.42	–	0.60	–
Nova Micro (8B)	0.56	0.37	0.55	0.6	–

In this post, we present a streamlined approach to implementing Amazon Nova LLM-as-a-Judge evaluations using SageMaker AI, interpreting the resulting metrics, and applying this process to improve your generative AI applications.

Overview of the evaluation workflow

The evaluation process starts by preparing a dataset in which each example includes a prompt and two alternative model outputs. The JSONL format looks like this:

{
   "prompt":"Explain photosynthesis.",
   "response_A":"Answer A...",
   "response_B":"Answer B..."
}
{
   "prompt":"Summarize the article.",
   "response_A":"Answer A...",
   "response_B":"Answer B..."
}

After preparing this dataset, you use the given SageMaker evaluation recipe, which configures the evaluation strategy, specifies which model to use as the judge, and defines the inference settings such as temperature and top_p.

The evaluation runs inside a SageMaker training job using pre-built Amazon Nova containers. SageMaker AI provisions compute resources, orchestrates the evaluation, and writes the output metrics and visualizations to Amazon Simple Storage Service (Amazon S3).

When it’s complete, you can download and analyze the results, which include preference distributions, win rates, and confidence intervals.

Understanding how Amazon Nova LLM-as-a-Judge works

The Amazon Nova LLM-as-a-Judge uses an evaluation method called binary overall preference judge. The binary overall preference judge is a method where a language model compares two outputs side by side and picks the better one or declares a tie. For each example, it produces a clear preference. When you aggregate these judgments over many samples, you get metrics like win rate and confidence intervals. This approach uses the model’s own reasoning to assess qualities like relevance and clarity in a straightforward, consistent way.

This judge model is meant to provide low-latency general overall preferences in situations where granular feedback isn’t necessary
The output of this model is one of [[A>B]] or [[B>A]]
Use cases for this model are primarily those where automated, low-latency, general pairwise preferences are required, such as automated scoring for checkpoint selection in training pipelines

Understanding Amazon Nova LLM-as-a-Judge evaluation metrics

When using the Amazon Nova LLM-as-a-Judge framework to compare outputs from two language models, SageMaker AI produces a comprehensive set of quantitative metrics. You can use these metrics to assess which model performs better and how reliable the evaluation is. The results fall into three main categories: core preference metrics, statistical confidence metrics, and standard error metrics.

The core preference metrics report how often each model’s outputs were preferred by the judge model. The a_scores metric counts the number of examples where Model A was favored, and b_scores counts cases where Model B was chosen as better. The ties metric captures instances in which the judge model rated both responses equally or couldn’t identify a clear preference. The inference_error metric counts cases where the judge couldn’t generate a valid judgment due to malformed data or internal errors.

The statistical confidence metrics quantify how likely it is that the observed preferences reflect true differences in model quality rather than random variation. The winrate reports the proportion of all valid comparisons in which Model B was preferred. The lower_rate and upper_rate define the lower and upper bounds of the 95% confidence interval for this win rate. For example, a winrate of 0.75 with a confidence interval between 0.60 and 0.85 suggests that, even accounting for uncertainty, Model B is consistently favored over Model A. The score field often matches the count of Model B wins but can also be customized for more complex evaluation strategies.

The standard error metrics provide an estimate of the statistical uncertainty in each count. These include a_scores_stderr, b_scores_stderr, ties_stderr, inference_error_stderr, andscore_stderr. Smaller standard error values indicate more reliable results. Larger values can point to a need for additional evaluation data or more consistent prompt engineering.

Interpreting these metrics requires attention to both the observed preferences and the confidence intervals:

If the winrate is substantially above 0.5 and the confidence interval doesn’t include 0.5, Model B is statistically favored over Model A.
Conversely, if the winrate is below 0.5 and the confidence interval is fully below 0.5, Model A is preferred.
When the confidence interval overlaps 0.5, the results are inconclusive and further evaluation is recommended.
High values in inference_error or large standard errors suggest there might have been issues in the evaluation process, such as inconsistencies in prompt formatting or insufficient sample size.

The following is an example metrics output from an evaluation run:

{
  "a_scores": 16.0,
  "a_scores_stderr": 0.03,
  "b_scores": 10.0,
  "b_scores_stderr": 0.09,
  "ties": 0.0,
  "ties_stderr": 0.0,
  "inference_error": 0.0,
  "inference_error_stderr": 0.0,
  "score": 10.0,
  "score_stderr": 0.09,
  "winrate": 0.38,
  "lower_rate": 0.23,
  "upper_rate": 0.56
}

In this example, Model A was preferred 16 times, Model B was preferred 10 times, and there were no ties or inference errors. The winrate of 0.38 indicates that Model B was preferred in 38% of cases, with a 95% confidence interval ranging from 23% to 56%. Because the interval includes 0.5, this outcome suggests the evaluation was inconclusive, and additional data might be needed to clarify which model performs better overall.

These metrics, automatically generated as part of the evaluation process, provide a rigorous statistical foundation for comparing models and making data-driven decisions about which one to deploy.

Solution overview

This solution demonstrates how to evaluate generative AI models on Amazon SageMaker AI using the Nova LLM-as-a-Judge capability. The provided Python code guides you through the entire workflow.

First, it prepares a dataset by sampling questions from SQuAD and generating candidate responses from Qwen2.5 and Anthropic’s Claude 3.7. These outputs are saved in a JSONL file containing the prompt and both responses.

We accessed Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock using the bedrock-runtime client. We accessed Qwen2.5 1.5B using a SageMaker hosted Hugging Face endpoint.

Next, a PyTorch Estimator launches an evaluation job using an Amazon Nova LLM-as-a-Judge recipe. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including win rates, confidence intervals, and preference counts. Results are saved to Amazon S3 for analysis.

Finally, a visualization function renders charts and tables, summarizing which model was preferred, how strong the preference was, and how reliable the estimates are. Through this end-to-end approach, you can assess improvements, track regressions, and make data-driven decisions about deploying generative models—all without manual annotation.

Prerequisites

You need to complete the following prerequisites before you can run the notebook:

Make the following quota increase requests for SageMaker AI. For this use case, you need to request a minimum of 1 g5.12xlarge instance. On the Service Quotas console, request the following SageMaker AI quotas, 1 G5 instances (g5.12xlarge) for training job usage
(Optional) You can create an Amazon SageMaker Studio domain (refer to Use quick setup for Amazon SageMaker AI) to access Jupyter notebooks with the preceding role. (You can use JupyterLab in your local setup, too.)
- Create an AWS Identity and Access Management (IAM) role with managed policies AmazonSageMakerFullAccess, AmazonS3FullAccess, and AmazonBedrockFullAccess to give required access to SageMaker AI and Amazon Bedrock to run the examples.
- Assign as trust relationship to your IAM role the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "bedrock.amazonaws.com",
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Clone the GitHub repository with the assets for this deployment. This repository consists of a notebook that references training assets:

git clone https://github.com/aws-samples/amazon-nova-samples.git
cd customization/SageMakerTrainingJobs/Amazon-Nova-LLM-As-A-Judge/

Next, run the notebook Nova Amazon-Nova-LLM-as-a-Judge-Sagemaker-AI.ipynb to start using the Amazon Nova LLM-as-a-Judge implementation on Amazon SageMaker AI.

Model setup

To conduct an Amazon Nova LLM-as-a-Judge evaluation, you need to generate outputs from the candidate models you want to compare. In this project, we used two different approaches: deploying a Qwen2.5 1.5B model on Amazon SageMaker and invoking Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. First, we deployed Qwen2.5 1.5B, an open-weight multilingual language model, on a dedicated SageMaker endpoint. This was achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B model, we provided a convenient script for you to invoke:python3 deploy_sm_model.py

When it’s deployed, inference can be performed using a helper function wrapping the SageMaker predictor API:

# Initialize the predictor once
predictor = HuggingFacePredictor(endpoint_name="qwen25-<endpoint_name_here>")
def generate_with_qwen25(prompt: str, max_tokens: int = 500, temperature: float = 0.9) -> str:
    """
    Sends a prompt to the deployed Qwen2.5 model on SageMaker and returns the generated response.
    Args:
        prompt (str): The input prompt/question to send to the model.
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature for generation.
    Returns:
        str: The model-generated text.
    """
    response = predictor.predict({
        "inputs": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature
        }
    })
    return response[0]["generated_text"]
answer = generate_with_qwen25("What is the Grotto at Notre Dame?")
print(answer)

In parallel, we integrated Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. Amazon Bedrock provides a managed API layer for accessing proprietary foundation models (FMs) without managing infrastructure. The Claude generation function used the bedrock-runtime AWS SDK for Python (Boto3) client, which accepted a user prompt and returned the model’s text completion:

# Initialize Bedrock client once
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
# (Claude 3.7 Sonnet) model ID via Bedrock
MODEL_ID = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"
def generate_with_claude4(prompt: str, max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.9) -> str:
    """
    Sends a prompt to the Claude 4-tier model via Amazon Bedrock and returns the generated response.
    Args:
        prompt (str): The user message or input prompt.
        max_tokens (int): Maximum number of tokens to generate.
        temperature (float): Sampling temperature for generation.
        top_p (float): Top-p nucleus sampling.
    Returns:
        str: The text content generated by Claude.
    """
    payload = {
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p
    }
    response = bedrock.invoke_model(
        modelId=MODEL_ID,
        body=json.dumps(payload),
        contentType="application/json",
        accept="application/json"
    )
    response_body = json.loads(response['body'].read())
    return response_body["content"][0]["text"]
answer = generate_with_claude4("What is the Grotto at Notre Dame?")
print(answer)

When you have both functions generated and tested, you can move on to creating the evaluation data for the Nova LLM-as-a-Judge.

Prepare the dataset

To create a realistic evaluation dataset for comparing the Qwen and Claude models, we used the Stanford Question Answering Dataset (SQuAD), a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.

We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets library to download and load the first 20 examples from the SQuAD training split:

from datasets import load_dataset
squad = load_dataset("squad", split="train[:20]")

This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:

print(squad[3]["question"])
print(squad[3]["answers"]["text"][0])

For the evaluation set, we selected the first six questions from this subset:

questions = [squad[i]["question"] for i in range(6)]

Generate the Amazon Nova LLM-as-a-Judge evaluation dataset

After preparing a set of evaluation questions from SQuAD, we generated outputs from both models and assembled them into a structured dataset to be used by the Amazon Nova LLM-as-a-Judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes. To do this, we iterated over each question prompt and invoked the two generation functions defined earlier:

generate_with_qwen25() for completions from the Qwen2.5 model deployed on SageMaker
generate_with_claude() for completions from Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock

For each prompt, the workflow attempted to generate a response from each model. If a generation call failed due to an API error, timeout, or other issue, the system captured the exception and stored a clear error message indicating the failure. This made sure that the evaluation process could proceed gracefully even in the presence of transient errors:

import json
output_path = "llm_judge.jsonl"
with open(output_path, "w") as f:
    for q in questions:
        try:
            response_a = generate_with_qwen25(q)
        except Exception as e:
            response_a = f"[Qwen2.5 generation failed: {e}]"
        
        try:
            response_b = generate_with_claude4(q)
        except Exception as e:
            response_b = f"[Claude 3.7 generation failed: {e}]"
        row = {
            "prompt": q,
            "response_A": response_a,
            "response_B": response_b
        }
        f.write(json.dumps(row) + "n")
print(f"JSONL file created at: {output_path}")

This workflow produced a JSON Lines file named llm_judge.jsonl. Each line contains a single evaluation record structured as follows:

{
  "prompt": "What is the capital of France?",
  "response_A": "The capital of France is Paris.",
  "response_B": "Paris is the capital city of France."
}

Then, upload this llm_judge.jsonl to an S3 bucket that you’ve predefined:

upload_to_s3(
    "llm_judge.jsonl",
    "s3://<YOUR_BUCKET_NAME>/datasets/byo-datasets-dev/custom-llm-judge/llm_judge.jsonl"
)

Launching the Nova LLM-as-a-Judge evaluation job

After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova LLM-as-a-Judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the model, processes the dataset, and generates evaluation metrics in your designated Amazon S3 location.

We use the PyTorch estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, the container image, the evaluation recipe, and the output paths for storing results:

estimator = PyTorch(
    output_path=output_s3_uri,
    base_job_name=job_name,
    role=role,
    instance_type=instance_type,
    training_recipe=recipe_path,
    sagemaker_session=sagemaker_session,
    image_uri=image_uri,
    disable_profiler=True,
    debugger_hook_config=False,
)

When the estimator is configured, you initiate the evaluation job using the fit() method. This call submits the job to the SageMaker control plane, provisions the compute cluster, and begins processing the evaluation dataset:

estimator.fit(inputs={"train": evalInput})

Results from the Amazon Nova LLM-as-a-Judge evaluation job

The following graphic illustrates the results of the Amazon Nova LLM-as-a-Judge evaluation job.

To help practitioners quickly interpret the outcome of a Nova LLM-as-a-Judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics. This function, plot_nova_judge_results, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.

This function takes the evaluation metrics dictionary—produced when the evaluation job is complete—and generates the following visual components:

Score distribution bar chart – Shows how many times Model A was preferred, how many times Model B was preferred, how many ties occurred, and how often the judge failed to produce a decision (inference errors). This provides an immediate sense of how decisive the evaluation was and whether either model is dominating.
Win rate with 95% confidence interval – Plots Model B’s overall win rate against Model A, including an error bar reflecting the lower and upper bounds of the 95% confidence interval. A vertical reference line at 50% marks the point of no preference. If the confidence interval doesn’t cross this line, you can conclude the result is statistically significant.
Preference pie chart – Visually displays the proportion of times Model A, Model B, or neither was preferred. This helps quickly understand preference distribution among the valid judgments.
A vs. B score comparison bar chart – Compares the raw counts of preferences for each model side by side. A clear label annotates the margin of difference to emphasize which model had more wins.
Win rate gauge – Depicts the win rate as a semicircular gauge with a needle pointing to Model B’s performance relative to the theoretical 0–100% range. This intuitive visualization helps nontechnical stakeholders understand the win rate at a glance.
Summary statistics table – Compiles numerical metrics—including total evaluations, error counts, win rate, and confidence intervals—into a compact, clean table. This makes it straightforward to reference the exact numeric values behind the plots.

Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation.

Clean up

Complete the following steps to clean up your resources:

Delete your Qwen 2.5 1.5B Endpoint

import boto3

# Create a low-level SageMaker service client.

sagemaker_client = boto3.client('sagemaker', region_name=<region>)

# Delete endpoint

sagemaker_client.delete_endpoint(EndpointName=endpoint_name)

If you’re using a SageMaker Studio JupyterLab notebook, shut down the JupyterLab notebook instance.

How you can use this evaluation framework

The Amazon Nova LLM-as-a-Judge workflow offers a reliable, repeatable way to compare two language models on your own data. You can integrate this into model selection pipelines to decide which version performs best, or you can schedule it as part of continuous evaluation to catch regressions over time.

For teams building agentic or domain-specific systems, this approach provides richer insight than automated metrics alone. Because the entire process runs on SageMaker training jobs, it scales quickly and produces clear visual reports that can be shared with stakeholders.

Conclusion

This post demonstrates how Nova LLM-as-a-Judge—a specialized evaluation model available through Amazon SageMaker AI—can be used to systematically measure the relative performance of generative AI systems. The walkthrough shows how to prepare evaluation datasets, launch SageMaker AI training jobs with Nova LLM-as-a-Judge recipes, and interpret the resulting metrics, including win rates and preference distributions. The fully managed SageMaker AI solution simplifies this process, so you can run scalable, repeatable model evaluations that align with human preferences.

We recommend starting your LLM evaluation journey by exploring the official Amazon Nova documentation and examples. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.

To learn more, visit:

About the authors

Surya Kari is a Senior Generative AI Data Scientist at AWS, specializing in developing solutions leveraging state-of-the-art foundation models. He has extensive experience working with advanced language models including DeepSeek-R1, the Llama family, and Qwen, focusing on their fine-tuning and optimization. His expertise extends to implementing efficient training pipelines and deployment strategies using AWS SageMaker. He collaborates with customers to design and implement generative AI solutions, helping them navigate model selection, fine-tuning approaches, and deployment strategies to achieve optimal performance for their specific use cases.

Joel Carlson is a Senior Applied Scientist on the Amazon AGI foundation modeling team. He primarily works on developing novel approaches for improving the LLM-as-a-Judge capability of the Nova family of models.

Saurabh Sahu is an applied scientist in the Amazon AGI Foundation modeling team. He obtained his PhD in Electrical Engineering from University of Maryland College Park in 2019. He has a background in multi-modal machine learning working on speech recognition, sentiment analysis and audio/video understanding. Currently, his work focuses on developing recipes to improve the performance of LLM-as-a-judge models for various tasks.

Morteza Ziyadi is an Applied Science Manager at Amazon AGI, where he leads several projects on post-training recipes and (Multimodal) large language models in the Amazon AGI Foundation modeling team. Before joining Amazon AGI, he spent four years at Microsoft Cloud and AI, where he led projects focused on developing natural language-to-code generation models for various products. He has also served as an adjunct faculty at Northeastern University. He earned his PhD from the University of Southern California (USC) in 2017 and has since been actively involved as a workshop organizer, and reviewer for numerous NLP, Computer Vision and machine learning conferences.

Pradeep Natarajan is a Senior Principal Scientist in Amazon AGI Foundation modeling team working on post-training recipes and Multimodal large language models. He has 20+ years of experience in developing and launching multiple large-scale machine learning systems. He has a PhD in Computer Science from University of Southern California.

Michael Cai is a Software Engineer on the Amazon AGI Customization Team supporting the development of evaluation solutions. He obtained his MS in Computer Science from New York University in 2024. In his spare time he enjoys 3d printing and exploring innovative tech.

AI Generated Robotic Content