ML 20083 image 1
Deploying large language models (LLMs) for inference requires reliable GPU capacity, especially during critical evaluation periods, limited-duration production testing, or burst workloads. Capacity constraints can delay deployments and impact application performance. Customers can use Amazon SageMaker AI training plans to reserve compute capacity for specified time periods. Originally designed for training workloads, training plans now support inference endpoints, providing predictable GPU availability for time-bound inference workloads.
Consider a common scenario: you’re on a data science team that must evaluate several fine-tuned language models over a two-week period before selecting one for production. They require uninterrupted access to ml.p5.48xlarge instances to run comparative benchmarks, but on-demand capacity in their AWS Region is unpredictable during peak hours. By reserving capacity through training plans, they can run evaluations uninterrupted with controlled costs and predictable availability.
Amazon SageMaker AI training plans offer a flexible way to secure capacity so you can search for available offerings, select the instance type, quantity, and duration that match your needs. Customers can select a fixed number of days or months into the future, or a specified number of days at a stretch, to create a reservation. After created, the training plan provides a set capacity that can be referenced when deploying SageMaker AI inference endpoints.
In this post, we walk through how to search for available p-family GPU capacity, create a training plan reservation for inference, and deploy a SageMaker AI inference endpoint on that reserved capacity. We follow a data scientist’s journey as they reserve capacity for model evaluation and manage the endpoint throughout the reservation lifecycle.
SageMaker AI training plans provide a mechanism to reserve compute capacity for specific time windows. When creating a training plan, customers specify their target resource type. By setting the value of the target resource to “endpoint”, you can secure p-family GPU instances specifically for inference workloads. The reserved capacity is referenced through an Amazon Resource Name (ARN) in the endpoint configuration so that the endpoint deploys the reserved instances.
The training plan creation and utilization workflow consists of four key phases:
Let’s walk through each phase with detailed examples.
Before starting, ensure that you have the following:
Our data scientist begins by identifying available p-family GPU capacity that matches their evaluation requirements. They need one ml.p5.48xlarge instance for a week-long evaluation starting in late January. Using the search-training-plan-offerings API, they specify the instance type, instance count, duration, and time window. Setting target resources to “endpoint” configures the capacity to be provisioned specifically for inference rather than training jobs.
Example output
The response provides detailed information about each available capacity block, including the instance type, quantity, duration, Availability Zone, and pricing. Each offering includes specific start and end times, so you can select a reservation that aligns with your deployment schedule. In this case, the team finds a 168-hour (7-day) reservation in us-west-2a that fits their timeline.
After identifying a suitable offering, the team creates the training plan reservation to secure the capacity:
Example output:
The TrainingPlanArn uniquely identifies the reserved capacity. You save this ARN, it’s the key that will link their endpoint to the set p-family GPU capacity. With the reservation confirmed and paid for, they’re now ready to configure their inference endpoint.
You can also create training plans through the SageMaker AI console. This provides a visual interface for searching capacity and completing the reservation. The console workflow follows three steps: search for offerings, add plan details, and review and purchase.
Navigating to Training Plans:
The following screenshot shows the Training Plans landing page where you initiate the creation workflow.
Figure 1: Training Plans landing page with Create training plan button
Step A – Search for training plan offerings:
ml.p5.48xlarge) and Instance count.The following screenshot shows the search interface with Inference Endpoint selected and the criteria filled in:
Figure 2: Step A – Search training plan offerings with Inference Endpoint target
After selecting Find training plan, the Available plans section displays matching offerings:
Figure 3: Available training plan offerings with pricing and availability details
Complete the reservation:
After the reservation is created, you receive a training plan ARN. With the reservation confirmed and paid for, you’re now ready to configure their inference endpoint using this ARN. The endpoint will only function during the reservation window specified in the training plan.
With the reservation secured, the team creates an endpoint configuration that binds their inference endpoint to the reserved capacity. The critical step here is including the CapacityReservationConfig object in the ProductionVariants section where they set the MlReservationArn to the training plan ARN received earlier:
When SageMaker AI receives this request, it validates that the ARN points to an active training plan reservation with a target resource type of “endpoint”. If validation succeeds, the endpoint configuration is created and becomes eligible for deployment. The CapacityReservationPreference setting is particularly important. By setting it to capacity-reservations-only, the team restricts the endpoint to their reserved capacity, so it stops serving traffic when the reservation ends, preventing unexpected charges.
With the endpoint configuration ready, the team deploys their evaluation endpoint:
The endpoint now runs entirely within the reserved training plan capacity. SageMaker AI provisions the ml.p5.48xlarge instance in us-west-2a and loads the model, this process can take several minutes. After the endpoint reaches InService status, the team can begin their evaluation workload.
With the endpoint in service, you can begin running their evaluation workload. They invoke the endpoint for real-time inference, sending test prompts and measuring response quality, latency, and throughput:
During the active reservation window, the endpoint operates normally with a set capacity. All invocations are processed using the reserved resources, helping to facilitate predictable performance and availability. The team can run their benchmarks without worrying about capacity constraints or performance variability from shared infrastructure.
It’s worth understanding what happens if the training plan reservation expires while the endpoint is still deployed.
When the reservation expires, endpoint behavior depends on the CapacityReservationPreference setting. Because the team set it to capacity-reservations-only, the endpoint stops serving traffic and invocations fail with a capacity error:
Expected error response:
To resume service, you must either create a new training plan reservation and update the endpoint configuration or update the endpoint to use on-demand or ODCR capacity. In the team’s case, because they completed their evaluation, they delete the endpoint rather than extending the reservation.
During the evaluation period, you might need to update the endpoint for various reasons. SageMaker AI supports several update scenarios while maintaining the connection to reserved capacity.
Midway through the evaluation, the team wants to test a new model version that incorporates additional fine-tuning. They can update to the new model version while keeping the same reserved capacity:
If the team’s evaluation runs longer than expected or if they want to transition the endpoint to production use beyond the reservation period, they can migrate to on-demand capacity:
In some scenarios, teams can reserve more capacity than they initially deploy, giving them flexibility to scale up if needed. For example, if the team reserved two instances but initially deployed only one, they cam scale up during the evaluation period to test higher throughput scenarios.
Suppose the team initially reserved two ml.p5.48xlarge instances but deployed their endpoint with only one instance. Later, they want to test how the model performs under higher concurrent load:
If customers attempt to scale beyond the reserved capacity, the update will fail:
Expected error:
After completing their week-long evaluation, the team has gathered all the performance metrics that they need and selected their top-performing model. They’re ready to clean up the inference endpoint. The training plan reservation automatically expires at the end of the reservation window. You are charged for the full reservation period regardless of when you delete the endpoint.
Important considerations:
It’s important to note that deleting an endpoint doesn’t refund or cancel the training plan reservation. The reserved capacity remains allocated until the training plan reservation window expires, regardless of whether the endpoint is still running. However, if the reservation is still active and capacity is available, you can create a new endpoint using the same training plan reservation ARN. To fully clean up, delete the endpoint configuration:
When setting up your training plan reservation, keep in mind that you’re committing to a fixed window of time and will be charged for the full duration upfront, regardless of how long you actually use it. Before purchasing, make sure that your estimated timeline aligns with the reservation length that you choose. If you think your evaluation might be completed early, the cost will not change.
For example, if you purchase a 7-day reservation, you will pay for all seven days even if you complete your work in five. The upside is that this predictable, upfront cost structure helps you to budget accurately for your project. You will know exactly what you’re spending before you start.
Note: When you delete your endpoint, the training plan reservation isn’t canceled or refunded. The reserved capacity stays allocated until the reservation window expires. If you finish early and want to use the remaining time, you can redeploy a new endpoint using the same training plan reservation ARN, if the reservation is still active and capacity is available.
SageMaker AI training plans provide a straightforward way to reserve p-family GPU capacity and deploy SageMaker AI inference endpoints with set availability. This approach is recommended for time-bound workloads such as model evaluation, limited-duration production testing, and burst scenarios where predictable capacity is essential.
As we saw in our data science team’s journey, the process involves identifying capacity requirements, searching for available offerings, creating a reservation, and referencing that reservation in the endpoint configuration to deploy the endpoint during the reservation window. The team completed their week-long model evaluation with a set capacity, avoiding the unpredictability of on-demand availability during peak hours. They could focus on their evaluation of metrics rather than worrying about infrastructure constraints.
With support for endpoint updates, scaling within reservation limits, and seamless migration to on-demand capacity, training plans give you the flexibility to manage inference workloads while maintaining control over GPU availability and costs. Whether you’re running competitive model benchmarks, performing limited-duration A/B tests, or handling predictable traffic spikes, training plans for inference endpoints provide the capacity that you need with transparent, upfront pricing.
Special thanks to Alwin (Qiyun) Zhao, Piyush Kandpal, Jeff Poegel, Qiushi Wuye, Jatin Kulkarni, Shambhavi Sudarsan, and Karan Jain for their contribution.
During a hearing Tuesday, a district court judge questioned the Department of Defense’s motivations for…
Researchers have discovered that some of the elements of AI neural networks that contribute to…
If you look at the architecture diagram of almost any AI startup today, you will…
Memory is one of the most overlooked parts of agentic system design.
In the modern AI landscape, an agent loop is a cyclic, repeatable, and continuous process…