ML 19175 image 1
Evaluating the performance of large language models (LLMs) goes beyond statistical metrics like perplexity or bilingual evaluation understudy (BLEU) scores. For most real-world generative AI scenarios, it’s crucial to understand whether a model is producing better outputs than a baseline or an earlier iteration. This is especially important for applications such as summarization, content generation, or intelligent agents where subjective judgments and nuanced correctness play a central role.
As organizations deepen their deployment of these models in production, we’re experiencing an increasing demand from customers who want to systematically assess model quality beyond traditional evaluation methods. Current approaches like accuracy measurements and rule-based evaluations, although helpful, can’t fully address these nuanced assessment needs, particularly when tasks require subjective judgments, contextual understanding, or alignment with specific business requirements. To bridge this gap, LLM-as-a-judge has emerged as a promising approach, using the reasoning capabilities of LLMs to evaluate other models more flexibly and at scale.
Today, we’re excited to introduce a comprehensive approach to model evaluation through the Amazon Nova LLM-as-a-Judge capability on Amazon SageMaker AI, a fully managed Amazon Web Services (AWS) service to build, train, and deploy machine learning (ML) models at scale. Amazon Nova LLM-as-a-Judge is designed to deliver robust, unbiased assessments of generative AI outputs across model families. Nova LLM-as-a-Judge is available as optimized workflows on SageMaker AI, and with it, you can start evaluating model performance against your specific use cases in minutes. Unlike many evaluators that exhibit architectural bias, Nova LLM-as-a-Judge has been rigorously validated to remain impartial and has achieved leading performance on key judge benchmarks while closely reflecting human preferences. With its exceptional accuracy and minimal bias, it sets a new standard for credible, production-grade LLM evaluation.
Nova LLM-as-a-Judge capability provides pairwise comparisons between model iterations, so you can make data-driven decisions about model improvements with confidence.
Nova LLM-as-a-Judge was built through a multistep training process comprising supervised training and reinforcement learning stages that used public datasets annotated with human preferences. For the proprietary component, multiple annotators independently evaluated thousands of examples by comparing pairs of different LLM responses to the same prompt. To verify consistency and fairness, all annotations underwent rigorous quality checks, with final judgments calibrated to reflect broad human consensus rather than an individual viewpoint.
The training data was designed to be both diverse and representative. Prompts spanned a wide range of categories, including real-world knowledge, creativity, coding, mathematics, specialized domains, and toxicity, so the model could evaluate outputs across many real-world scenarios. Training data included data from over 90 languages and is primarily composed of English, Russian, Chinese, German, Japanese, and Italian.Importantly, an internal bias study evaluating over 10,000 human-preference judgments against 75 third-party models confirmed that Amazon Nova LLM-as-a-Judge shows only a 3% aggregate bias relative to human annotations. Although this is a significant achievement in reducing systematic bias, we still recommend occasional spot checks to validate critical comparisons.
In the following figure, you can see how the Nova LLM-as-a-Judge bias compares to human preferences when evaluating Amazon Nova outputs compared to outputs from other models. Here, bias is measured as the difference between the judge’s preference and human preference across thousands of examples. A positive value indicates the judge slightly favors Amazon Nova models, and a negative value indicates the opposite. To quantify the reliability of these estimates, 95% confidence intervals were computed using the standard error for the difference of proportions, assuming independent binomial distributions.
Amazon Nova LLM-as-a-Judge achieves advanced performance among evaluation models, demonstrating strong alignment with human judgments across a range of tasks. For example, it scores 45% accuracy on JudgeBench (compared to 42% for Meta J1 8B) and 68% on PPE (versus 60% for Meta J1 8B). The data from Meta’s J1 8B was pulled from Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning.
These results highlight the strength of Amazon Nova LLM-as-a-Judge in chatbot-related evaluations, as shown in the PPE benchmark. Our benchmarking follows current best practices, reporting reconciled results for positionally swapped responses on JudgeBench, CodeUltraFeedback, Eval Bias, and LLMBar, while using single-pass results for PPE.
Model | Eval Bias | Judge Bench | LLM Bar | PPE | CodeUltraFeedback |
Nova LLM-as-a-Judge | 0.76 | 0.45 | 0.67 | 0.68 | 0.64 |
Meta J1 8B | – | 0.42 | – | 0.60 | – |
Nova Micro (8B) | 0.56 | 0.37 | 0.55 | 0.6 | – |
In this post, we present a streamlined approach to implementing Amazon Nova LLM-as-a-Judge evaluations using SageMaker AI, interpreting the resulting metrics, and applying this process to improve your generative AI applications.
The evaluation process starts by preparing a dataset in which each example includes a prompt and two alternative model outputs. The JSONL format looks like this:
After preparing this dataset, you use the given SageMaker evaluation recipe, which configures the evaluation strategy, specifies which model to use as the judge, and defines the inference settings such as temperature
and top_p
.
The evaluation runs inside a SageMaker training job using pre-built Amazon Nova containers. SageMaker AI provisions compute resources, orchestrates the evaluation, and writes the output metrics and visualizations to Amazon Simple Storage Service (Amazon S3).
When it’s complete, you can download and analyze the results, which include preference distributions, win rates, and confidence intervals.
The Amazon Nova LLM-as-a-Judge uses an evaluation method called binary overall preference judge. The binary overall preference judge is a method where a language model compares two outputs side by side and picks the better one or declares a tie. For each example, it produces a clear preference. When you aggregate these judgments over many samples, you get metrics like win rate and confidence intervals. This approach uses the model’s own reasoning to assess qualities like relevance and clarity in a straightforward, consistent way.
When using the Amazon Nova LLM-as-a-Judge framework to compare outputs from two language models, SageMaker AI produces a comprehensive set of quantitative metrics. You can use these metrics to assess which model performs better and how reliable the evaluation is. The results fall into three main categories: core preference metrics, statistical confidence metrics, and standard error metrics.
The core preference metrics report how often each model’s outputs were preferred by the judge model. The a_scores
metric counts the number of examples where Model A was favored, and b_scores
counts cases where Model B was chosen as better. The ties
metric captures instances in which the judge model rated both responses equally or couldn’t identify a clear preference. The inference_error
metric counts cases where the judge couldn’t generate a valid judgment due to malformed data or internal errors.
The statistical confidence metrics quantify how likely it is that the observed preferences reflect true differences in model quality rather than random variation. The winrate
reports the proportion of all valid comparisons in which Model B was preferred. The lower_rate
and upper_rate
define the lower and upper bounds of the 95% confidence interval for this win rate. For example, a winrate
of 0.75 with a confidence interval between 0.60 and 0.85 suggests that, even accounting for uncertainty, Model B is consistently favored over Model A. The score
field often matches the count of Model B wins but can also be customized for more complex evaluation strategies.
The standard error metrics provide an estimate of the statistical uncertainty in each count. These include a_scores_stderr
, b_scores_stderr
, ties_stderr
, inference_error_stderr
, andscore_stderr
. Smaller standard error values indicate more reliable results. Larger values can point to a need for additional evaluation data or more consistent prompt engineering.
Interpreting these metrics requires attention to both the observed preferences and the confidence intervals:
winrate
is substantially above 0.5 and the confidence interval doesn’t include 0.5, Model B is statistically favored over Model A.winrate
is below 0.5 and the confidence interval is fully below 0.5, Model A is preferred.inference_error
or large standard errors suggest there might have been issues in the evaluation process, such as inconsistencies in prompt formatting or insufficient sample size.The following is an example metrics output from an evaluation run:
In this example, Model A was preferred 16 times, Model B was preferred 10 times, and there were no ties or inference errors. The winrate
of 0.38 indicates that Model B was preferred in 38% of cases, with a 95% confidence interval ranging from 23% to 56%. Because the interval includes 0.5, this outcome suggests the evaluation was inconclusive, and additional data might be needed to clarify which model performs better overall.
These metrics, automatically generated as part of the evaluation process, provide a rigorous statistical foundation for comparing models and making data-driven decisions about which one to deploy.
This solution demonstrates how to evaluate generative AI models on Amazon SageMaker AI using the Nova LLM-as-a-Judge capability. The provided Python code guides you through the entire workflow.
First, it prepares a dataset by sampling questions from SQuAD and generating candidate responses from Qwen2.5 and Anthropic’s Claude 3.7. These outputs are saved in a JSONL file containing the prompt and both responses.
We accessed Anthropic’s Claude 3.7 Sonnet in Amazon Bedrock using the bedrock-runtime
client. We accessed Qwen2.5 1.5B using a SageMaker hosted Hugging Face endpoint.
Next, a PyTorch Estimator launches an evaluation job using an Amazon Nova LLM-as-a-Judge recipe. The job runs on GPU instances such as ml.g5.12xlarge and produces evaluation metrics, including win rates, confidence intervals, and preference counts. Results are saved to Amazon S3 for analysis.
Finally, a visualization function renders charts and tables, summarizing which model was preferred, how strong the preference was, and how reliable the estimates are. Through this end-to-end approach, you can assess improvements, track regressions, and make data-driven decisions about deploying generative models—all without manual annotation.
You need to complete the following prerequisites before you can run the notebook:
AmazonSageMakerFullAccess
, AmazonS3FullAccess
, and AmazonBedrockFullAccess
to give required access to SageMaker AI and Amazon Bedrock to run the examples.Next, run the notebook Nova Amazon-Nova-LLM-as-a-Judge-Sagemaker-AI.ipynb
to start using the Amazon Nova LLM-as-a-Judge implementation on Amazon SageMaker AI.
To conduct an Amazon Nova LLM-as-a-Judge evaluation, you need to generate outputs from the candidate models you want to compare. In this project, we used two different approaches: deploying a Qwen2.5 1.5B model on Amazon SageMaker and invoking Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. First, we deployed Qwen2.5 1.5B, an open-weight multilingual language model, on a dedicated SageMaker endpoint. This was achieved by using the HuggingFaceModel deployment interface. To deploy the Qwen2.5 1.5B model, we provided a convenient script for you to invoke:python3 deploy_sm_model.py
When it’s deployed, inference can be performed using a helper function wrapping the SageMaker predictor API:
In parallel, we integrated Anthropic’s Claude 3.7 Sonnet model in Amazon Bedrock. Amazon Bedrock provides a managed API layer for accessing proprietary foundation models (FMs) without managing infrastructure. The Claude generation function used the bedrock-runtime AWS SDK for Python (Boto3) client, which accepted a user prompt and returned the model’s text completion:
When you have both functions generated and tested, you can move on to creating the evaluation data for the Nova LLM-as-a-Judge.
To create a realistic evaluation dataset for comparing the Qwen and Claude models, we used the Stanford Question Answering Dataset (SQuAD), a widely adopted benchmark in natural language understanding distributed under the CC BY-SA 4.0 license. SQuAD consists of thousands of crowd-sourced question-answer pairs covering a diverse range of Wikipedia articles. By sampling from this dataset, we made sure that our evaluation prompts reflected high-quality, factual question-answering tasks representative of real-world applications.
We began by loading a small subset of examples to keep the workflow fast and reproducible. Specifically, we used the Hugging Face datasets
library to download and load the first 20 examples from the SQuAD training split:
This command retrieves a slice of the full dataset, containing 20 entries with structured fields including context, question, and answers. To verify the contents and inspect an example, we printed out a sample question and its ground truth answer:
For the evaluation set, we selected the first six questions from this subset:
questions = [squad[i]["question"] for i in range(6)]
After preparing a set of evaluation questions from SQuAD, we generated outputs from both models and assembled them into a structured dataset to be used by the Amazon Nova LLM-as-a-Judge workflow. This dataset serves as the core input for SageMaker AI evaluation recipes. To do this, we iterated over each question prompt and invoked the two generation functions defined earlier:
generate_with_qwen25()
for completions from the Qwen2.5 model deployed on SageMakergenerate_with_claude()
for completions from Anthropic’s Claude 3.7 Sonnet in Amazon BedrockFor each prompt, the workflow attempted to generate a response from each model. If a generation call failed due to an API error, timeout, or other issue, the system captured the exception and stored a clear error message indicating the failure. This made sure that the evaluation process could proceed gracefully even in the presence of transient errors:
This workflow produced a JSON Lines file named llm_judge.jsonl
. Each line contains a single evaluation record structured as follows:
Then, upload this llm_judge.jsonl
to an S3 bucket that you’ve predefined:
After preparing the dataset and creating the evaluation recipe, the final step is to launch the SageMaker training job that performs the Amazon Nova LLM-as-a-Judge evaluation. In this workflow, the training job acts as a fully managed, self-contained process that loads the model, processes the dataset, and generates evaluation metrics in your designated Amazon S3 location.
We use the PyTorch
estimator class from the SageMaker Python SDK to encapsulate the configuration for the evaluation run. The estimator defines the compute resources, the container image, the evaluation recipe, and the output paths for storing results:
When the estimator is configured, you initiate the evaluation job using the fit()
method. This call submits the job to the SageMaker control plane, provisions the compute cluster, and begins processing the evaluation dataset:
estimator.fit(inputs={"train": evalInput})
The following graphic illustrates the results of the Amazon Nova LLM-as-a-Judge evaluation job.
To help practitioners quickly interpret the outcome of a Nova LLM-as-a-Judge evaluation, we created a convenience function that produces a single, comprehensive visualization summarizing key metrics. This function, plot_nova_judge_results
, uses Matplotlib and Seaborn to render an image with six panels, each highlighting a different perspective of the evaluation outcome.
This function takes the evaluation metrics dictionary—produced when the evaluation job is complete—and generates the following visual components:
Because the function outputs a standard Matplotlib figure, you can quickly save the image, display it in Jupyter notebooks, or embed it in other documentation.
Complete the following steps to clean up your resources:
The Amazon Nova LLM-as-a-Judge workflow offers a reliable, repeatable way to compare two language models on your own data. You can integrate this into model selection pipelines to decide which version performs best, or you can schedule it as part of continuous evaluation to catch regressions over time.
For teams building agentic or domain-specific systems, this approach provides richer insight than automated metrics alone. Because the entire process runs on SageMaker training jobs, it scales quickly and produces clear visual reports that can be shared with stakeholders.
This post demonstrates how Nova LLM-as-a-Judge—a specialized evaluation model available through Amazon SageMaker AI—can be used to systematically measure the relative performance of generative AI systems. The walkthrough shows how to prepare evaluation datasets, launch SageMaker AI training jobs with Nova LLM-as-a-Judge recipes, and interpret the resulting metrics, including win rates and preference distributions. The fully managed SageMaker AI solution simplifies this process, so you can run scalable, repeatable model evaluations that align with human preferences.
We recommend starting your LLM evaluation journey by exploring the official Amazon Nova documentation and examples. The AWS AI/ML community offers extensive resources, including workshops and technical guidance, to support your implementation journey.
To learn more, visit:
Details on how the big diffusion model finetunes are trained is scarce, so just like…
Our advanced model officially achieved a gold-medal level performance on problems from the International Mathematical…
The ever-increasing parameter counts of deep learning models necessitate effective compression techniques for deployment on…
Extracting meaningful insights from unstructured data presents significant challenges for many organizations. Meeting recordings, customer…
The incident's legacy extends far beyond CrowdStrike. Organizations now implement staged rollouts and maintain manual…
“Unfortunately, I think ‘No bad person should ever benefit from our success’ is a pretty…