ML 21184 1
Today, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync API, removing the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each invocation.
For payloads up to 128,000 bytes, this removes an entire network round-trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads.
In this post, we explain the motivation behind this feature, walk through the customer experience before and after, and show you how to start using inline payloads today.
You can use Amazon SageMaker AI Async Inference to queue inference requests and process them asynchronously. It’s a good fit for workloads with large payloads, variable traffic, or tolerance for seconds-to-minutes latency. It supports automatic scaling to zero, making it cost-efficient for bursty or batch-style workloads.
Until now, the workflow required two steps on every invocation:
InputLocation.The endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client polls or receives via Amazon Simple Notification Service (Amazon SNS) notification.
This two-step pattern works well for large payloads (images, audio, multi-MB documents). But for customers with small input payloads (in KB) who need longer processing times than real-time inference allows, the mandatory S3 dependency added unnecessary complexity.
With today’s launch, InvokeEndpointAsync accepts a new Body parameter. When present, the payload is sent inline in the API request itself, with no S3 upload required.
Key details:
| Aspect | Details |
| New parameter | Body, raw bytes, capped at 128,000 bytes. |
| Max inline size | 128,000 bytes (raw payload). |
| Mutual exclusivity | Body and InputLocation are mutually exclusive. The API rejects requests that set both. |
| Output behavior | Unchanged. Output is written to the S3 OutputLocation. |
| Endpoint compatibility | Designed to work with existing async endpoints; no model or container changes expected. |
| Error handling | Size and mutual-exclusivity violations return synchronous ValidationError responses. |
| Availability | Available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV). |
The change is clearest in code. The two examples that follow perform the same async invocation against the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the inline Body parameter that replaces it.
This approach requires:
s3:PutObject permission on the caller.No S3 client, no uuid, no input bucket, no IAM grants on the input path, no stale-object cleanup.
Sending the payload inline removes a network hop and a dependency from each request. That translates into five concrete benefits:
s3:PutObject permission on the input path.Inline payloads are typically the simpler choice for small payloads, but InputLocation still has its place. Use the following table to decide which path fits a given workload:
| Scenario | Recommended approach |
| Payload <= 128,000 bytes (JSON prompts, structured data) | Inline Body. Simpler. Avoids one network round-trip and S3 PUT charges. |
| Payload > 128,000 bytes (images, audio, large documents) | InputLocation. Upload to S3 first. |
| Mixed workload with variable payload sizes | Branch on size. Use Body for small, InputLocation for large. |
| Need to retain input data in S3 for audit or replay | InputLocation. Keeps inputs in your bucket. |
See the example code notebook for a full walkthrough.
Before you begin, make sure you have:
aws sagemaker describe-endpoint --endpoint-name my-async-endpoint).sagemaker:InvokeEndpointAsync.my-output-bucket).Note: Following this guide uses billable AWS resources. SageMaker AI async inference endpoints incur charges for instance hours, and S3 buckets incur charges for storage and requests. Follow the cleanup steps after completing the tutorial to avoid ongoing charges.
Inline payload support is available today. To use it:
pip install --upgrade boto3.pip show boto3.InputLocation pattern with a direct Body parameter, as shown in the preceding code example.InvokeEndpointAsync API with the Body parameter.OutputLocation field.OutputLocation to confirm your inference result was written successfully.No changes are needed to your endpoint configuration, model container, or output S3 setup.
To avoid ongoing charges, delete the resources used in this walkthrough:
Inline payload support for SageMaker AI Async Inference removes a common friction point in asynchronous inference workflows: the mandatory S3 upload for every request. For the majority of inference payloads that fit within 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest.
The feature is designed to be backward-compatible. Existing InputLocation workflows continue unchanged. Both inline and S3 inputs are processed identically once the request is accepted, and models receive identical requests regardless of input source.
Get started today by updating your AWS SDK and using the Body parameter on the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.
The UK’s 5-million-plus small and midsize businesses and enterprises (SMBs) are the backbone of our…
The United Kingdom, and London in particular, continues to be one of the great hubs…
Days before Anthropic took its most advanced AI models offline, the White House ordered the…
From facial recognition on smartphones to humanoid robots, computer vision technology, which serves as the…
Hi, I'm Dever and I like training LORAs, you can download this one from Huggingface…
Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured,…