ML20076 image 1
Your AI agent worked in the demo, impressed stakeholders, handled test scenarios, and seemed ready for production. Then you deployed it, and the picture changed. Real users experienced wrong tool calls, inconsistent responses, and failure modes nobody anticipated during testing.
The result is a gap between expected agent behavior and actual user experience in production. Agent evaluation introduces challenges that traditional software testing wasn’t designed to handle. Because large language models (LLMs) are non-deterministic, the same user query can produce different tool selections, reasoning paths, and outputs across multiple runs. This means that you must test each scenario repeatedly to understand your agent’s actual behavior patterns. A single test pass tells you what can happen, not what typically happens. Without systematic measurement across these variations, teams are trapped in cycles of manual testing and reactive debugging. This burns through API costs without clear insight into whether changes improve agent performance. This uncertainty makes every prompt modification risky and leaves a fundamental question unanswered: “Is this agent actually better now?”
In this post, we introduce Amazon Bedrock AgentCore Evaluations, a fully managed service for assessing AI agent performance across the development lifecycle. We walk through how the service measures agent accuracy across multiple quality dimensions. We explain the two evaluation approaches for development and production and share practical guidance for building agents you can deploy with confidence.
When a user sends a request to an agent, multiple decisions happen in sequence. The agent determines which tools (if any) to call, executes those calls, and generates a response based on the results. Each step introduces potential failure points: selecting the wrong tool, calling the right tool with incorrect parameters, or synthesizing tool outputs into an inaccurate final answer. Unlike traditional applications where you test a single function’s output, agent evaluation requires measuring quality across this entire interaction flow.
This creates specific challenges for agent developers that can be addressed by doing the following:
Each of these definitions directly determines what your evaluation system measures and getting them wrong means optimizing for the wrong outcomes. Without this foundational work, the gap between what teams hope their agents do and what they can prove their agents do becomes a real business risk.Bridging this gap requires a continuous evaluation cycle, as shown in Figure 1. Teams build test cases, run them against the agent, score the results, analyze failures, and implement improvements. Each failure becomes a new test case, and the cycle continues through every iteration of the agent.
Figure 1: The agent evaluation process follows a continuous cycle of test cases, agent execution, scoring, analysis, and improvements. Failures become new test cases.
Running this cycle end to end, however, requires significant infrastructure beyond the evaluation logic itself. Teams must curate datasets, select and host scoring models, manage inference capacity and API rate limits, build data pipelines that transform agent traces into evaluation-ready formats, and create dashboards to visualize trends. For organizations running multiple agents, this overhead multiplies with each one. The result is that agent developer teams end up spending more time maintaining evaluation tooling than acting on what it tells them. This is the problem Amazon Bedrock AgentCore Evaluations was built to address.
First launched in public preview at AWS re:Invent 2025, the service is now generally available. It handles the evaluation models, inference infrastructure, data pipelines, and scaling so teams can focus on improving agent quality rather than building and maintaining evaluation systems. For built-in evaluators, model quota and inference capacity are fully managed. This means that organizations evaluating many agents aren’t consuming their own quotas or provisioning separate infrastructure for evaluation workloads.
AgentCore Evaluations examine agent behavior end-to-end using OpenTelemetry (OTEL) traces with generative AI semantic conventions. OTEL is an open source observability standard for collecting distributed traces from applications. The generative AI semantic conventions extend it with fields specific to language model interactions, including prompts, completions, tool calls, and model parameters. By building on this standard, the service works consistently across agents built with any Strands Agents or LangGraph, and instrumented with OpenTelemetry and OpenInference, capturing the full context needed for meaningful evaluation.
The evaluations can be configured with different approaches:
In the LLM-as-a-Judge approach, the Judge model examines the full interaction context, including conversation history, available tools, tools used, parameters passed, and system instructions, then provides detailed reasoning before assigning a score. Every score comes with an explanation. Teams can use these scores to verify judgments, understand exactly why an interaction received a particular rating, and identify what should have happened differently. This approach goes beyond simple pass/fail judgments, providing the structured evaluation and transparent reasoning that enable quality assessment at a scale that manual review cannot match.
Three principles guide how the service approaches evaluation. Evidence-driven development replaces intuition with quantitative metrics, so teams can measure the actual impact of changes rather than debating whether a prompt modification “feels better.” Multi-dimensional assessment evaluates different aspects of agent behavior independently. This makes it possible to pinpoint exactly where improvements are needed rather than relying on a single aggregate score. Continuous measurement connects the performance baselines established during development directly to production monitoring, making sure that quality holds up as real-world conditions evolve. These principles apply throughout the agent lifecycle, from the first round of development testing through ongoing production monitoring.
An agent’s journey from prototype to production creates two distinct evaluation needs. During development, teams need controlled environments where they can compare alternatives, test the agent on curated datasets, reproduce results, and validate changes before they reach users. After the agent is live, the challenge shifts to monitoring real-world interactions at scale, where users encounter edge cases and interaction patterns that no amount of pre-deployment testing anticipated. Figure 2 illustrates how evaluation supports each stage of this journey, from initial proof of concept through shadow testing, A/B testing, and continuous production monitoring.
Figure 2: From POC to production, evaluation validates agents before deployment. As agents mature, evaluation supports shadow testing, A/B testing, and continuous monitoring at scale.
AgentCore Evaluations map two complementary approaches to these lifecycle phases, as shown in Figure 3. Online evaluation handles continuous production monitoring, while on-demand evaluation supports controlled testing during development and continuous integration and continuous delivery (CI/CD) workflows, including evaluations against ground truth.
| On-demand Evaluation | Online Evaluation | |
| Advantages |
|
|
| Use cases |
|
|
Figure 3: Online evaluation monitors production traffic continuously, while on-demand evaluation supports controlled testing during development.
Online evaluation monitors live agent interactions by continuously sampling a configurable percentage of traces and scoring them against your chosen evaluators. You define which evaluators to apply, set sampling rules that control what fraction of production traffic gets evaluated, and set up appropriate filters. The service handles reading traces, running evaluations, and surfacing results in the AgentCore Observability dashboard powered by Amazon CloudWatch. If you’re already collecting traces for observability, online evaluation adds quality scores with explanation, alongside your existing operational metrics without requiring code changes or re-deployments. Figure 4 shows how this process works.
Quality issues in production often surface in ways that traditional monitoring misses. Operational dashboards may show green across latency and error rates while user experience quietly degrades because the agent starts selecting wrong tools or providing less helpful responses. Continuous quality scoring catches these silent failures by tracking evaluation metrics alongside operational ones. Because AgentCore Observability runs on CloudWatch, you can create custom dashboards and set alarms to get alerted the moment scores drop below your thresholds.
On-demand evaluation is a real-time API designed for development and CI/CD workflows. Teams use it to test changes before deployment, run evaluation suites as part of CI/CD pipelines, perform regression testing across builds, and gate deployments on quality thresholds. Developers select a full session and specify exact spans (individual operations within a trace) or traces by providing their IDs. The service considers the full session conversation and scores individual span/traces against the same evaluators used in production. Common use cases include validating prompt changes, comparing model performance across alternatives, and preventing quality regressions.
Figure 5: On-demand evaluation enables developers to prepare trace datasets, invoke evaluations through a CI/CD pipeline or development environment, and receive scores using built-in or custom evaluators powered by Amazon Bedrock foundation models.
Because both modes use the same evaluators, what you test in CI/CD is what you monitor in production, giving you consistent quality standards across the entire development lifecycle. On-demand evaluation provides the controlled environment needed for architecture decisions and systematic improvement, while online evaluation maintains quality monitoring continues after the agent is live. Together, the two modes form a continuous feedback loop between development and production, and both draw from the same set of evaluators and scoring infrastructure.
AgentCore Evaluations organizes agent interactions into a three-level hierarchy that determines what can be evaluated and at what granularity. A session represents a complete conversation between a user and your agent, grouping all related interactions from a single user or workflow. Within each session, a trace captures everything that happens during a single exchange. When a user sends a message and receives a response, that round trip produces one trace containing every step that the agent took to generate its answer. Each trace in turn contains individual operations called spans, representing specific actions your agent performed, such as invoking a tool, retrieving information from a knowledge base, or generating text.
Different evaluators operate at different levels of this hierarchy, and problems at one level can look very different from problems at another. The service provides 13 pre-configured built-in evaluators organized across these three levels, each measuring a distinct aspect of agent behavior (Figure 6). You can define custom evaluators using LLM-as-a-Judge and custom code evaluators that can work on session, trace and span levels.
| Level | Evaluators | Purpose | Ground Truth Use |
| Session | Goal Success Rate | Assesses whether all user goals were completed within a conversation | User provides free form textual assertions of goal completion, which are compared against system behavior and measured via Goal Success Rate |
| Trace | Helpfulness, Correctness, Coherence, Conciseness, Faithfulness, Harmfulness, Instruction Following, Response Relevance, Context Relevance, Refusal, Stereotyping | Evaluates response quality, accuracy, safety, and communication effectiveness | Turn level ground truth (e.g., expected answer or attributes per turn) supports evaluation of Correctness |
| Tool | Tool Selection Accuracy, Tool Parameter Accuracy | Assesses tool selection decisions and parameter extraction precision | Tool call ground truth specifies the correct tool sequence enabling Trajectory Exact Order Match, Trajectory In-Order Match, and Trajectory Any Order Match |
Figure 6: Built-in evaluators operate at session, trace, and tool levels. Each level measures different aspects of agent behavior. Ground Truth can be provided as assertions, expected response and expected trajectory for evaluation on session, trace and tool level.
Evaluating each level independently helps teams to diagnose whether a problem originates in tool selection, response generation, or session-level planning. An agent might choose the right tool with accurate parameters but then synthesize the tool’s output poorly in its final response. This pattern only becomes visible when each level is assessed on its own. Your agent’s primary purpose guides which evaluators to prioritize. Customer service agents should focus on Helpfulness, Goal Success Rate, and Instruction Following, since resolving user issues within defined guardrails directly impacts satisfaction. Agents with Retrieval Augmented Generation (RAG) components benefit most from Correctness and Faithfulness to make sure that responses are grounded in the provided context. Tool-heavy agents need strong Tool Selection Accuracy and Tool Parameter Accuracy scores. It’s recommended to start with three or four evaluators that align with your agent’s purpose and expand coverage as your understanding matures.
Some evaluators naturally interact with each other, so scores should be read together rather than in isolation. Evaluators that sound similar often measure fundamentally different things, and understanding these distinctions is important for diagnosis.
Some evaluators also depend on or trade-off against each other. For instance:
Built-in evaluators ship with predefined prompt templates, selected evaluator models, and standardized scoring criteria, with configurations fixed to preserve consistency across evaluations. They use cross-Region inference to automatically select compute from AWS Regions within your geography, improving model availability and throughput while keeping data stored in the originating Region. Custom evaluators extend this foundation with support for your own evaluator model, evaluation instructions, criteria, and scoring schema. They’re particularly valuable for industry-specific assessments such as compliance checking in healthcare or financial services, brand voice consistency verification, or enforcing organizational quality standards. Custom code evaluators let you bring in an AWS Lambda function to perform the evaluations. This allows you to also create deterministic scoring of your agents.
For use cases requiring all processing within a single Region, custom evaluators also provide full control over inference configuration. When building a custom evaluator, you define instructions with placeholders that get replaced with actual trace information before being sent to the judge model. The scope of information available depends on the evaluator’s level: a session-level evaluator can access the full conversation context and available tools, a trace-level evaluator sees previous turns plus the current assistant response, and a tool-level evaluator focuses on specific tool calls within their surrounding context. The AWS console provides the option to load the prompt template of any existing built-in evaluator as a starting point, making it straightforward to create custom variants (Figure 7).
Figure 7: The AgentCore Evaluations console provides the option to load any built-in evaluator’s prompt template as a starting point when creating a custom evaluator.
When building multiple custom evaluators, use the MECE (Mutually Exclusive, Collectively Exhaustive) principle to design your evaluation suite. Each evaluator should have a distinct, non-overlapping scope while collectively covering all quality dimensions you care about. For example, rather than creating two evaluators that both partially assess “response quality,” separate them into one that evaluates factual grounding and another that evaluates communication clarity. Furthermore, to write evaluator instructions, establish the judge model’s role as a performance evaluator to prevent confusion between evaluation and task execution. Use clear, sequential instructions with precise language, and consider including one to three relevant examples with matching input/output pairs that represent your expected standards. For scoring, choose between binary scales (0/1) for pass/fail scenarios or ordinal scales (such as 1–5) for more nuanced assessments, and start with binary scoring when uncertain. The service standardizes output to include a reason field followed by a score field, so the judge model always presents its reasoning before assigning a number. Avoid including your own output formatting instructions, as they can confuse the Judge model.
Built-in and custom evaluators both use an LLM-as-a-Judge. AgentCore Evaluations also supports a third approach: code-based evaluators, where an AWS Lambda function can be used as the evaluator with your custom code.
Code-based evaluators are ideal when you have heuristic scoring methods that don’t require language understanding to verify. An LLM evaluator can judge whether a response “sounds correct,” but it cannot reliably confirm that a specific pay stub figure of $8,333.33 appears verbatim in a response, or that a generated request ID follows the format PTO-2026-NNN. For these deterministic checks, a custom code is faster, cheaper, and more reliable. There are four situations where code-based evaluators are particularly helpful:
A code-based evaluator is configured as an AWS Lambda function with your custom logic. AgentCore passes the agent’s OTel spans to your function as a structured event and expects a result in return. Your function extracts whatever information it needs from the spans and returns a score, a label, and an explanation.
Once your Lambda is deployed and granted permission to be invoked by the AgentCore service principal, you register it as an evaluator for AgentCore. Once registered, the evaluator ID can be used for on-demand evaluation.
Configuring the service involves three steps. Select your agent, choose your evaluators, and set your sampling rules. Before you begin, deploy your agent using AgentCore Runtime and set up observability through OpenTelemetry or OpenInference instrumentation. The AgentCore samples repository on GitHub provides working examples.
Create a new online evaluation configuration through the AgentCore Evaluations console. Here, you specify which evaluators to apply, which data source to monitor, and what sampling parameters to use. For the data source, select either an existing AgentCore Runtime endpoint or a CloudWatch log group for agents not hosted on AgentCore Runtime. Then choose your evaluators and define your sampling rules.
Figure 8: The AgentCore Evaluations console for creating an online evaluation configuration, including data source selection, evaluator assignment, and sampling rules.
You can also create configurations programmatically using the CreateOnlineEvaluationConfig API with a unique configuration name, data source, list of evaluators (up to 10), and IAM service role. The enableOnCreate parameter controls whether evaluation starts immediately or stays paused, and executionStatus determines whether the configuration actively processes traces once enabled. When a configuration is running, any custom evaluators it references become locked and cannot be modified or deleted. If you need to change an evaluator, clone it and create a new version. Online evaluation results are saved to a dedicated CloudWatch log group in JSON format.
After enabling your configuration, monitor results through the AgentCore Observability dashboard in Amazon CloudWatch. Agent-level views display aggregated evaluation metrics and trends, and you can drill into specific sessions and traces to see individual scores and the reasoning behind each one.
Figure 9: The AgentCore Observability dashboard displays evaluation metrics and trends at the agent level, with drill-down into individual sessions, traces, scores, and judge reasoning.
Drilling into an individual trace reveals the evaluation scores and detailed explanations for that specific interaction, so teams can verify judge reasoning and understand why the agent received a particular rating.
Figure 10: The trace-level view displays evaluation scores and explanations directly on individual traces, showing the judge model’s reasoning for each metric.
For development and testing, you can use on-demand evaluation to analyze specific interactions by selecting the traces or spans that you want to examine, applying your chosen evaluators, and receiving detailed scores with explanations. Results return directly in the API response, limited to 10 evaluations per call, with each result containing the span context, score, and reasoning. If an evaluation partially fails, the response includes both successful and failed results with error codes and messages. On-demand evaluation works well for testing custom evaluators, investigating specific quality issues, and validating fixes before deployment.
LLM-as-judge scoring tells you whether responses seem correct and helpful by the standards of a general-purpose language model. Ground truth evaluation takes this further by letting you specify the answer, the tools that should have been called, and the outcomes the session should have achieved. This helps you measure how closely the agent’s actual behavior matches your reference inputs. This is particularly valuable during development, when you have domain knowledge about what the right behavior is and want to test for specific scenarios.
AgentCore Evaluations supports three types of ground truth reference inputs, each consumed by a specific set of evaluators:
| Reference Input | Evaluator | What it measures |
expected_response | Builtin.Correctness | Similarity between the agent’s response and the known-correct answer |
expected_trajectory | Builtin.TrajectoryExactOrderMatch, Builtin.TrajectoryInOrderMatch, Builtin.TrajectoryAnyOrderMatch | Whether the agent called the right tools in the right sequence |
assertions | Builtin.GoalSuccessRate | Whether the session satisfied a set of natural-language statements about expected outcomes |
These inputs are optional and independent. Evaluators that don’t require ground truth such as Builtin.Helpfulness and Builtin.ResponseRelevance can be included in the same call as ground-truth evaluators, and each evaluator reads only the fields it needs. You can supply all three reference inputs simultaneously for a comprehensive evaluation, or supply only the subset relevant to a given scenario.
The bedrock-agentcore Python SDK provides two interfaces for ground truth evaluation: EvaluationClient for assessing existing sessions and OnDemandEvaluationRunner for automated dataset evaluation.
Evaluation Client is the right choice when you already have agent sessions recorded in CloudWatch and want to evaluate specific interactions. You provide the session ID, the agent ID, your chosen evaluators, a look back window for CloudWatch span retrieval, and optional Reference Inputs. The client fetches the session’s spans and submits them for evaluation. This is well suited to development analysis, debugging specific agent failures, and validating known interactions after prompt or model changes.
Evaluation Client works equally well for multi-turn sessions. When you pass a session ID from a multi-turn conversation, the client fetches all spans for that session and evaluates the complete dialogue. Trajectory evaluators verify tool usage across all turns, goal success assertions apply to the session, and correctness evaluators score each individual response against its corresponding expected answer.
On-Demand Evaluation Dataset Runner is the right choice when you want to evaluate your agent systematically across a curated dataset by invoking the agent for every scenario, collecting CloudWatch spans, and scoring results in a single automated workflow. You define a Dataset containing multi-turn scenarios with per-turn and per-scenario ground truth and provide an agent_invoker function that the runner calls for each turn. The runner manages session IDs and handles all coordination between invocation, span collection, and evaluation.
On-Demand Evaluation Dataset Runner is well suited to CI/CD pipelines where the same dataset runs against every build, regression testing after prompt or model changes, and batch evaluation across a large corpus of test cases before a release.
The two interfaces share the same evaluators and Reference Inputs schema, so you can develop and validate ground truth test cases interactively with Evaluation Client against existing production sessions, then promote those same scenarios into your Evaluation Runner dataset for systematic regression testing. The hands-on tutorial in the AgentCore samples repository demonstrates both interfaces end-to-end using an example agent across single-turn and multi-turn scenarios with all three types of ground truth reference inputs.
Success criteria for your agent typically combine three dimensions: the quality of responses, the latency at which users receive them, and the cost of inference. AgentCore Evaluations focuses on the quality dimension, while operational metrics like latency and cost are available through AgentCore Observability in CloudWatch. The following best practices are organized around the three evaluation principles described earlier, and reflect patterns that emerge from working with agent evaluation at scale.
Evidence-driven development
Multi-dimensional assessment
Continuous measurement
In this post, we showed how Amazon Bedrock AgentCore Evaluations helps teams move from reactive debugging to systematic quality management for AI agents. As a fully managed service, it handles the evaluation models, inference infrastructure, and data pipelines that teams would otherwise need to build and maintain for each agent. With on-demand evaluation anchoring the development workflow and online evaluation providing continuous production insight, quality becomes a measurable and improvable property throughout the agent lifecycle. The evaluator relationships and diagnostic patterns give a framework not just to score agents but for understanding where and why quality issues occur and where to focus improvement efforts.
To explore AgentCore Evaluations in detail, watch the public preview launch session from AWS re:Invent 2025 for a walkthrough with live demos. Visit the Amazon Bedrock AgentCore samples repository on GitHub for hands-on tutorials. For technical details on configuration and API usage, see the AgentCore Evaluations documentation. You can also review service limits and pricing.
A Lora trained on photos taken with the original Apple iPhone (2007). Works with FLUX.2…
In agentic AI systems , when an agent's execution pipeline is intentionally halted, we have…
We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English…
A suspected system failure froze Baidu’s robotaxis across Wuhan, trapping passengers and reportedly causing traffic…
Researchers at Trinity have developed a new light-based technology on a tiny chip that could…
Your monthly "Anzhc's Posts" issue have arrived. Today im introducing - Mugen - continuation of…