ML 21272 1
Monitoring and troubleshooting generative AI inference endpoints operating at scale is challenging. When your large language model (LLM) endpoint’s P99 latency spikes, you must determine in minutes whether the root cause is GPU memory pressure, a saturated KV cache, unbalanced traffic across Availability Zones, or an auto scaling policy that hasn’t triggered. The shift from training to serving is reshaping how teams deploy LLMs and other generative AI models in production. Machine learning (ML) platform engineers, MLOps teams, and site reliability engineers (SREs) must keep inference endpoints healthy, responsive, and cost-efficient, often across dozens of models and hundreds of GPU instances.
Amazon SageMaker AI provides fully managed real-time inference hosting for machine learning models. You deploy a model to a SageMaker endpoint backed by one or more compute instances, and SageMaker handles provisioning and scaling. SageMaker supports multiple endpoint architectures. This post focuses on the two most relevant to generative AI workloads with detailed observability:
SageMaker endpoints emit metrics like invocation counts, model latency, and overhead latency to Amazon CloudWatch. These aggregate metrics are useful for understanding overall endpoint health. Because teams scale to multi-model deployments on GPU fleets, they need deeper signals. Amazon SageMaker AI now emits over 100 detailed inference metrics. These cover GPU health, token-level latency, KV cache pressure, traffic distribution across AZs, inference component placement, and cold start diagnostics. These metrics flow to a built-in SageMaker Insights dashboard in Amazon CloudWatch, a fully managed observability solution that removes the need for custom Grafana dashboards and Prometheus configuration. The SageMaker Insights dashboard supports both endpoint types and automatically shows IC-specific panels when inference components are detected.
For more details on SageMaker inference, see Deploy models for real-time inference.
In this post, you will learn how to:
SageMaker inference endpoints emit native OpenTelemetry metrics to CloudWatch. The SageMaker Insights dashboard is located in the CloudWatch console under Infrastructure Monitoring → SageMaker Insights. It queries these metrics using PromQL and renders visualizations at the fleet, endpoint, and inference-component level across three tabs: Performance, Capacity, and Reliability.
For background on the OpenTelemetry and PromQL support in CloudWatch, see Introducing OpenTelemetry PromQL support in Amazon CloudWatch.
You must have the following to follow along with this post.
sagemaker:CreateEndpointConfig, sagemaker:UpdateEndpoint, and cloudwatch:GetMetricData.GPU instances receive per-accelerator utilization metrics in addition to the CPU and memory metrics available on all instance types. For the full setup guide, see Getting started with detailed observability.
For any new endpoint configurations you create, detailed metrics are turned on by default. The EnableDetailedObservability parameter in your endpoint configuration defaults to true. No additional code is required.
The EnableDetailedObservability flag in your endpoint configuration defaults to true, so no additional configuration is needed. You can also explicitly set the publishing frequency using MetricsPublishFrequencyInSeconds in MetricsConfig. The default is 60 seconds. For workloads that need near real-time monitoring, you can set it to less than a minute.
Within 2 minutes of the endpoint reaching InService, the OpenTelemetry format metrics begin flowing to CloudWatch.
Existing endpoints require an explicit opt-in. Create a new endpoint configuration with the MetricsConfig flag, then update your endpoint. This follows the same pattern as any endpoint configuration change.
The SageMaker console also provides a guided three-step wizard after you choose Enable detailed observability: learn about the metrics, turn on OTel enrichment, and select which endpoints to opt in.
Native OpenTelemetry metrics flow automatically to CloudWatch after enablement. However, existing classic metrics (Invocations, ModelLatency, OverheadLatency) require OTel enrichment to be visible in the SageMaker Insights dashboard and queryable with PromQL.
Navigate to CloudWatch Console then Settings and turn on OTel metric enrichment and Resource tags for telemetry. This is a one-time, account-level and AWS Region-level setting.
You can access the SageMaker Insights dashboard through either the SageMaker console or the CloudWatch console. Within SageMaker, there are three entry points, each pre-filtered to their context:
| # | Entry Point | Filter Applied | Use Case |
| 1 | Endpoints list page → “Open SageMaker Insights” | Fleet-level (all endpoints) | “Give me the big picture” |
| 2 | Endpoint detail page → “View in SageMaker Insights” | Filtered to that endpoint | “Drill into this specific endpoint” |
| 3 | IC tab → per-IC “Metrics” link | Filtered to endpoint + IC | “Debug this inference component” |
Every path deep-links with pre-applied filters, so you won’t land on a blank dashboard searching for your resources.
The Performance tab is where most customers spend their time. It answers questions like “Is everything running well?” and “If not, which component is the problem?” The Performance tab includes several time-series panels that work together to pinpoint latency issues.
Color-coded hexagons visualize every resource in your fleet. Toggle between Instances, IC Copies, and Endpoints views. The hexagon color indicates state:
Hover over any hexagon to see instance type, TTFT, output TPS, concurrent requests, KV cache utilization, and CloudWatch alarm status. Choose Filter by this instance to drill down. Every panel on the page updates to show only that instance’s data.
The table shows every instance with performance metrics side-by-side. Use this table to spot outliers in TTFT, output TPS, and concurrent requests. The TTFT, Output TPS, Concurrent Requests, and KV Cache columns show data emitted by the vLLM and SGLang frameworks only.
The Token streaming panel plots Time to First Token (TTFT) and Inter-Token Latency (ITL) over time with a P50/P99 toggle. TTFT measures how long users wait before seeing the first response character. ITL measures time between consecutive tokens, which directly affects streaming smoothness. You can filter by endpoint, inference component name, or model to isolate which component contributes to latency.
When you identify a TTFT spike, the Latency breakdown panel helps you attribute it. This panel separates total latency into Model Latency (time the model spends processing) and Overhead Latency (time the platform spends routing and scheduling). An Invoke tab shows the full request path, and a Streaming tab shows time-to-first-chunk specifically. If both Model Latency and Overhead Latency are normal but TTFT is still elevated, the model’s inference engine might be holding requests in its internal queue, for example, waiting for KV cache slots. Check the Engine and request pressure panel to confirm.
The Traffic distribution panel shows per-instance or per-inference-component request flow with Availability Zone filtering. Toggle the AZ dropdown to isolate traffic by zone. If one AZ shows zero traffic while others are loaded, that indicates a routing or placement issue. You can use the instance/IC toggle to switch between “Which machines handle traffic?” and “Which models handle traffic?” views.
Finally, the Token throughput panel measures actual tokens processed per second, broken down by input/output, percentiles, or by instance. This directly measures inference efficiency. For example, if your ml.g6.4xlarge delivers 150 tokens per second output when the model benchmark shows 500, that indicates a resource constraint, configuration issue, or KV cache pressure. The multi-framework legend (SGLang, vLLM, DJL) lets multi-model endpoints compare throughput across inference engines.
The Engine and request pressure panel is your early warning system for preventing outages.
The time-series view shows the per-framework breakdown, with tooltips that show exact values at any timestamp. If you see KV cache repeatedly climbing to 40–50 percent during business hours, configure autoscaling to trigger at a threshold value before customers feel the impact.
The Capacity tab answers questions like “Do I have enough resources?”, “Where is there headroom?”, and “Can I fit another model?”
The same honeycomb visualization from Performance reappears here, with resource utilization percentages in the hover card: GPU, GPU memory, CPU, CPU memory, and Disk.
Before you deploy a new model or scale copies, hover over instances in your target endpoint. If GPU memory is at 89 percent, there’s limited VRAM headroom for additional model weights.
This panel shows resource consumption trends with toggles for Instance, IC copies, and Endpoint aggregation. Key signals include the following:
The Reliability tab answers questions like “If an AZ goes down, will my inference fleet survive?”, “Are scaling events working?”, and “Why are cold starts slow?”
A bar chart shows instance and IC copy counts per AZ. This view shows your high availability posture.
| Distribution | Risk | Action |
| Even across over 3 AZs | Low | No action |
| Concentrated in 1-2 AZs | Medium | Rebalance |
| 0 instances in any AZ | High | Single AZ failure takes you offline |
Toggle between Instances and IC Copies. Instances might be balanced, but IC copies could be concentrated on a few machines.
Every IC provisioning event displayed as a horizontal stacked bar with four phases:
| Phase | Color | What it measures | Optimization |
| Model download | Blue | Pull model weights from Amazon Simple Storage Service (Amazon S3) | Compress artifacts, use Amazon Elastic File System (Amazon EFS) caching |
| GPU load | Purple | Load weights onto GPU | Smaller quantization, pre-warming |
| Container start | Orange | Container initialization | Reduce dependencies |
In the screenshot, gma-ic-vllm took 237.6 seconds, with model download dominating, while gma-rblk-ic-tiny was only 41.4 seconds because it’s a smaller model. This view tells you which phase to optimize for faster scaling response times.
The ICE diagnostics view tracks insufficient capacity errors (ICE), which occur when SageMaker can’t provision requested instances. The table shows:
In the preceding screenshot, all 12 ICE events are for p5.48xlarge across all four AZs, indicating complete regional exhaustion for this instance type. You now know to switch to other instance types as a fallback.
For teams with existing Grafana or other PromQL-compatible tools, you can query SageMaker Insights metrics directly from your platform without switching to the CloudWatch console. The following walkthrough demonstrates the setup using Grafana. The same steps apply to self-hosted Grafana or other compatible tools, with minor configuration differences.
Navigate to SageMaker Console, then select Endpoints. From there, select your endpoint and then choose Connect to your observability tool. Copy the displayed endpoint URL. It follows the format shown in the SageMaker console.
In Amazon Managed Grafana (Classic CloudWatch 2.4+) or self-hosted Grafana with the Amazon Managed Service for Prometheus plugin (v3.0.0+):
monitoring.cloudwatch:GetMetricData and cloudwatch:ListMetrics permissions.Download the dashboard template JSON from the same Connect to your observability tool page in the SageMaker console. Import the downloaded JSON template into Grafana (Dashboards → Import), select the Prometheus data source you configured in Step 2, and you get pre-configured Performance, Capacity, and Reliability panels matching the SageMaker Insights layout.
With the data source connected, you can write custom PromQL queries. For example:
KV cache
vllm:kv_cache_usage_perc{"aws.sagemaker.endpoint.name"="ep-prsn-ic","aws.sagemaker.inference_component.name"="ic-qwen3-4b"}
# Active requests
vllm:num_requests_running{"aws.sagemaker.endpoint.name"="ep-prsn-ic","aws.sagemaker.inference_component.name"="ic-qwen3-4b"}
# TTFT P99
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds{"aws.sagemaker.endpoint.name"="ep-prsn-ic","aws.sagemaker.inference_component.name"="ic-qwen3-4b"}[5m])) SageMaker doesn’t charge separately for emitting detailed observability metrics. The metrics are published to Amazon CloudWatch in OpenTelemetry data format, and standard CloudWatch OpenTelemetry ingestion pricing applies. OpenTelemetry metrics ingested into CloudWatch are charged at $0.50 per GB ingested. If you turn on OTel vended metric enrichment (required to view classic CloudWatch metrics like Invocations and ModelLatency in the Insights dashboard), enriched metrics are also charged at $0.50 per GB. For detailed pricing examples and a cost calculator, see the OpenTelemetry Metrics section on the Amazon CloudWatch pricing page.
To avoid ongoing charges, delete test resources in this order:
GPU instances are billed per second while endpoints are InService. Delete promptly after testing.
In this post, you enabled SageMaker detailed metrics on inference endpoints and used the built-in SageMaker Insights dashboard to monitor fleet health, debug latency using token-level metrics, validate high availability, and plan capacity for new deployments.
To get started, see the following resources:
The SageMaker Insights dashboard and detailed observability metrics are the result of close collaboration between the Amazon SageMaker AI and Amazon CloudWatch teams. We thank the engineering, product, and solutions architecture teams whose work made this launch possible.
We also thank the following contributors for their review and inputs on this blog post:
A year ago, Simon Willison wrote one of the cleanest definitions of an agent that…
The UK’s 5-million-plus small and midsize businesses and enterprises (SMBs) are the backbone of our…
Today, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now…
The United Kingdom, and London in particular, continues to be one of the great hubs…
Days before Anthropic took its most advanced AI models offline, the White House ordered the…
From facial recognition on smartphones to humanoid robots, computer vision technology, which serves as the…