ML 20696 1
Large language models (LLMs) now drive the most advanced conversational agents, creative tools, and decision-support systems. However, their raw output often contains inaccuracies, policy misalignments, or unhelpful phrasing—issues that undermine trust and limit real-world utility. Reinforcement Fine‑Tuning (RFT) has emerged as the preferred method to align these models efficiently, using automated reward signals to replace costly manual labeling.
At the heart of modern RFT is reward functions. They’re built for each domain through verifiable reward functions that can score LLM generations through a piece of code (Reinforcement Learning with Verifiable Rewards or RLVR) or with LLM-as-a-judge, where a separate language model evaluates candidate responses to guide alignment (Reinforcement Learning with AI Feedback or RLAIF). Both these methods provide scores to the RL algorithm to nudge the model to solve the problem at hand. In this post, we take a deeper look at how RLAIF or RL with LLM-as-a-judge works with Amazon Nova models effectively.
Reinforcement Fine-Tuning can use any reward signal, straightforward hand‑crafted rules (RLVR), or an LLM that evaluates model outputs (LLM-as-a-judge or RLAIF). RLAIF makes alignment far more flexible and powerful, especially when reward signals are vague and hard to craft manually. Unlike generic RFT rewards that rely on blunt numeric scoring like substring matching, an LLM judge reasons across multiple dimensions—correctness, tone, safety, relevance—providing context-aware feedback that captures subtleties and domain-specific nuances without task-specific retraining. Additionally, LLM judges offer built-in explainability through rationales (for example, “Response A cites peer-reviewed studies”), providing diagnostics that accelerate iteration, pinpoint failure modes directly, and reduce hidden misalignments, something static reward functions can’t do.
This section covers the key steps involved in designing and deploying LLM-as-a-judge reward functions.
The first critical decision is selecting your judge architecture. LLM-as-a-judge offers two primary evaluation modes: Rubric-based (point- based) judging and Preference-based judging, each suited to different alignment scenarios.
| Criteria | Rubric-based judging | Preference-based judging |
| Evaluation method | Assigns a numeric score to a single response using predefined criteria | Compares two candidate responses side-by-side and selects the superior one |
| Quality measurement | Absolute quality measurements | Relative quality through direct comparison |
| Preferred used when | Clear, quantifiable evaluation dimensions exist (accuracy, completeness, safety compliance) | Policy model should explore freely without reference data restrictions |
| Data requirements | Only requires careful prompt engineering to align the model to reward specifications | Requires at least one response sample for preference comparison |
| Generalizability | Better for out-of-distribution data, avoids data bias | Depends on quality of reference responses |
| Evaluation style | Mirrors absolute scoring systems | Mirrors natural human evaluation through comparison |
| Recommended starting point | Start here if preference data is unavailable and RLVR unsuitable | Use when comparative data is available |
After you’ve selected your judge type, articulate the specific dimensions that you want to improve. Clear evaluation criteria are the foundation of effective RLAIF training.
For Preference-based judges:
Write clear prompts explaining what makes one response better than another. Be explicit about quality preferences with concrete examples. Example: “Prefer responses that cite authoritative sources, use accessible language, and directly address the user’s question.”
For Rubric-based judges:
We recommend using Boolean (pass/fail) scoring for rubric-based judges. Boolean scoring is more reliable and reduces judge variability compared to fine-grained 1–10 scales. Define clear pass/fail criteria for each evaluation dimension with specific, observable characteristics.
Choose an LLM with sufficient reasoning capability to evaluate your target domain, configured through Amazon Bedrock and called using a reward AWS Lambda function. For common domains like math, coding, and conversational capabilities, smaller models can work well with careful prompt engineering.
| Model tier | Preferred for | Cost | Reliability | Amazon Bedrock model |
| Large/Heavyweight | Complex reasoning, nuanced evaluation, multi-dimensional scoring | High | Very High | Amazon Nova Pro, Claude Opus, Claude Sonnet |
| Medium/Lightweight | General domains like math or coding, balanced cost-performance | Low-Medium | Moderate-High | Amazon Nova 2 Lite, Claude Haiku |
Your judge prompt is the foundation of alignment quality. Design it to produce structured, parseable outputs with clear scoring dimensions:
Your reward function should mirror the metrics that you will use to evaluate the final model in production. Align your reward function with production success criteria to enable models designed for the correct objectives.
Alignment workflow:
Production RFT systems process thousands of reward evaluations per training step. Build a resilient reward Lambda function to help provide training stability, efficient compute usage, and reliable model behavior. This section covers how to build a reward Lambda function that’s resilient, efficient, and production ready.
Composite reward score structuring
Don’t rely solely on LLM judges. Combine them with fast, deterministic reward components that catch obvious failures before expensive judge evals:
Core components
| Component | Purpose | When to use |
| Format correctness | Verify JSON structure, required fields, schema compliance | Always – catches malformed outputs immediately. Cheap and instant feedback. |
| Length penalties | Discourage overly verbose or terse responses | When output length matters (for example, summaries) |
| Language consistency | Verify responses match input language | Critical for multilingual applications |
| Safety filters | Rule-based checks for prohibited content | Always – prevents unsafe content from reaching production |
Infrastructure readiness
Test your reward Lambda function for resilience
Validate judge consistency and calibration:
The following diagram illustrates the complete end-to-end training process, from baseline evaluation through judge validation to production deployment. Each step builds upon the previous one, creating a resilient pipeline that balances alignment quality with computational efficiency while actively preventing reward hacking and supporting production-ready model behavior.
In this section, we refer to a real-world use case with a leading legal industry partner. The task is to generate comments on risks, assessments, and actions on legal documentation with respect to the policies and previous contracts as reference documents.
Partner was interested in solving the problem of automating the process of reviewing, assessing, and flagging risks in legal contract documents. Specifically, they wanted to evaluate potential new contracts against internal guidelines and regulations, past contracts, and laws of the country pertaining to the contract.
We formulated this problem as one where we are providing a target document (the “contract” that needs evaluation), and a reference document (the grounding document and context) and expect the LLM to generate a JSON with multiple comments, comment types, and recommended actions to take based on the assessment. The original dataset available for this use case was relatively small that included complete contracts along with annotations and comments from legal experts. We used LLM as a judge using GPT OSS 120b model as the judge and a custom system prompt during RFT.
In the following section we cover details of the key aspects in the RFT workflow for this use case.
The following code snippets present the key components of the reward Lambda function.
Note: name of Lambda function should have “SageMaker”, for example, "arn:aws:lambda:us-east-1:123456789012:function:MyRewardFunctionSageMaker"
a) Start with defining a high-level objective
b) Define the evaluation approach
c) Describe the scoring dimensions with clear specifications on how a particular score should be calculated
d) Clearly define the final output format to parse
e) Create a high-level Lambda handler, providing sufficient multithreading for faster inference
We used the following AWS Identity and Access Management (IAM) permissions and settings in the Lambda function. The following configurations are required for reward Lambda functions. RFT training can fail if any of them are missing.
a) Permissions for Amazon SageMaker AI execution role
Your Amazon SageMaker AI execution role must have permission to invoke your Lambda function. Add this policy to your Amazon SageMaker AI execution role:
b) Permissions for Lambda function execution role
Your Lambda function’s execution role needs basic Lambda execution permissions and the permissions to Invoke the judge Amazon Bedrock model.
Note: This solution follows the AWS shared responsibility model. AWS is responsible for securing the infrastructure that runs AWS services in the cloud. You are responsible for securing your Lambda function code, configuring IAM permissions, implementing encryption and access controls, managing data security and privacy, configuring monitoring and logging, and verifying compliance with applicable regulations. Follow the principle of least privilege by scoping permissions to specific resource ARNs. For more information, see Security in AWS Lambda and Amazon SageMaker AI Security in the AWS documentation.
c) Add provisioned concurrency
Publish a version of the Lambda and to enable the function to scale without fluctuations in latency, we added some provisioned concurrency. 100 was sufficient in this case, however, there’s more room for cost improvements here.
d) Set Lambda timeout to 15 mins
We launched Nova Forge SDK that can be used for the entire model customization lifecycle—from data preparation to deployment and monitoring. Nova Forge SDK removes the need to search for the appropriate recipes or container URI for specific techniques.
You can use the Nova Forge SDK to customize training parameters in two ways: provide a full recipe YAML using recipe_path or pass specific fields using overrides for selective changes. For this use case, we use overrides to tune the rollout and trainer settings as shown in the following section.
RFT with Amazon Nova 2 Lite achieved a 4.33 aggregate score—the highest performance across all evaluated models—while maintaining perfect JSON schema validation. This represents a significant improvement, demonstrating that RFT can produce production-ready, specialized models that outperform larger general-purpose alternatives.
We evaluated models using a “best of k” single-comment setting, where each model generated multiple comments per sample and we scored the highest-quality output. This approach establishes an upper bound on performance and enables a fair comparison between models that produce single versus multiple outputs.
Figure 1 — JSON Schema Validation Scores (0–1 scale, higher is better)
Figure 2 — Aggregate LLM judge scores (1–5 scale, higher is better)
Amazon Nova 2 Lite with RFT achieved a 4.33 aggregate score, outperforming both Claude Sonnet 4.5 and Claude Haiku 4.5, while also achieving perfect JSON schema validation.
During SFT iterations, we observed problematic behaviors including repetitive comment generation and unnatural Unicode character predictions. These issues, likely caused by overfitting or dataset imbalances, didn’t appear in RFT checkpoints. RFT’s reward-based improvements naturally discourages such artifacts, producing more robust and reliable outputs.
When we evaluated RFT models using a modified judge prompt (aligned but not identical to the training reward function), performance remained strong. This demonstrates that RFT learns generalizable quality patterns rather than overfitting specific evaluation criteria. This is a critical advantage for real-world deployment where requirements evolve.
RFT required 4–8 rollouts per training sample, increasing compute costs compared to SFT. This overhead is amplified when using non-zero reasoning effort settings. However, for mission-critical applications where alignment quality directly impacts business outcomes—such as legal contract review, financial compliance, or healthcare documentation, the performance gains justify the additional compute costs.
Reinforcement Fine-Tuning (RFT) with LLM-as-a-judge represents a powerful approach to aligning LLMs for domain-specific applications. As demonstrated in our legal contract review case study, this methodology delivers significant improvements over both base models and traditional supervised fine-tuning (SFT) approaches, with RFT achieving the highest aggregate scores across all evaluation dimensions. For teams building mission-critical AI systems where alignment quality directly impacts business outcomes, RFT with LLM-as-a-judge offers a compelling path forward. The methodology’s explainability, flexibility, and superior performance make it particularly valuable for complex domains like legal review (or Financial Services or Healthcare) where subtle nuances matter.
Organizations considering this approach should start small—validate their judge design on curated benchmarks, verify infrastructure resilience, and scale gradually while monitoring for reward hacking. With proper implementation, RFT can transform capable base models into highly specialized, production-ready systems that consistently deliver aligned, trustworthy outputs.
References:
Disclaimer:
The legal contract review use case described in this post is for technical demonstration purposes only. AI-generated contract analysis is not a substitute for professional legal advice. Consult qualified legal counsel for legal matters.
I'll try to keep this as short as possible, but me and a team of…
TurboQuant has recently been launched by Google as a novel algorithmic suite and library for…
Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained…
The National Health Service (NHS) has delivered universal healthcare to an entire nation for over…
Welcome to the second Cloud CISO Perspectives for April 2026. Today, Francis deSouza, COO Google…
Messages presented at trial reveal how Zilis, the mother of four of Musk's children, acted…