Categories: FAANG

Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 2

pre optimized

As generative artificial intelligence (AI) inference becomes increasingly critical for businesses, customers are seeking ways to scale their generative AI operations or integrate generative AI models into existing workflows. Model optimization has emerged as a crucial step, allowing organizations to balance cost-effectiveness and responsiveness, improving productivity. However, price-performance requirements vary widely across use cases. For chat applications, minimizing latency is key to offer an interactive experience, whereas real-time applications like recommendations require maximizing throughput. Navigating these trade-offs poses a significant challenge to rapidly adopt generative AI, because you must carefully select and evaluate different optimization techniques.

To overcome these challenges, we are excited to introduce the inference optimization toolkit, a fully managed model optimization feature in Amazon SageMaker. This new feature delivers up to ~2x higher throughput while reducing costs by up to ~50% for generative AI models such as Llama 3, Mistral, and Mixtral models. For example, with a Llama 3-70B model, you can achieve up to ~2400 tokens/sec on a ml.p5.48xlarge instance v/s ~1200 tokens/sec previously without any optimization.

This inference optimization toolkit uses the latest generative AI model optimization techniques such as compilation, quantization, and speculative decoding to help you reduce the time to optimize generative AI models from months to hours, while achieving the best price-performance for your use case. For compilation, the toolkit uses the Neuron Compiler to optimize the model’s computational graph for specific hardware, such as AWS Inferentia, enabling faster runtimes and reduced resource utilization. For quantization, the toolkit utilizes Activation-aware Weight Quantization (AWQ) to efficiently shrink the model size and memory footprint while preserving quality. For speculative decoding, the toolkit employs a faster draft model to predict candidate outputs in parallel, enhancing inference speed for longer text generation tasks. To learn more about each technique, refer to Optimize model inference with Amazon SageMaker. For more details and benchmark results for popular open source models, see Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

In this post, we demonstrate how to get started with the inference optimization toolkit for supported models in Amazon SageMaker JumpStart and the Amazon SageMaker Python SDK. SageMaker JumpStart is a fully managed model hub that allows you to explore, fine-tune, and deploy popular open source models with just a few clicks. You can use pre-optimized models or create your own custom optimizations. Alternatively, you can accomplish this using the SageMaker Python SDK, as shown in the following notebook. For the full list of supported models, refer to Optimize model inference with Amazon SageMaker.

Using pre-optimized models in SageMaker JumpStart

The inference optimization toolkit provides pre-optimized models that have been optimized for best-in-class cost-performance at scale, without any impact to accuracy. You can choose the configuration based on the latency and throughput requirements of your use case and deploy in a single click.

Taking the Meta-Llama-3-8b model in SageMaker JumpStart as an example, you can choose Deploy from the model page. In the deployment configuration, you can expand the model configuration options, select the number of concurrent users, and deploy the optimized model.

Deploying a pre-optimized model with the SageMaker Python SDK

You can also deploy a pre-optimized generative AI model using the SageMaker Python SDK in just a few lines of code. In the following code, we set up a ModelBuilder class for the SageMaker JumpStart model. ModelBuilder is a class in the SageMaker Python SDK that provides fine-grained control over various deployment aspects, such as instance types, network isolation, and resource allocation. You can use it to create a deployable model instance, converting framework models (like XGBoost or PyTorch) or Inference Specs into SageMaker-compatible models for deployment. Refer to Create a model in Amazon SageMaker with ModelBuilder for more details.

sample_input = {
    "inputs": "Hello, I'm a language model,",
    "parameters": {"max_new_tokens":128, "do_sample":True}
}

sample_output = [
    {
        "generated_text": "Hello, I'm a language model, and I'm here to help you with your English."
    }
]
schema_builder = SchemaBuilder(sample_input, sample_output)

builder = ModelBuilder(
    model="meta-textgeneration-llama-3-8b", # JumpStart model ID
    schema_builder=schema_builder,
    role_arn=role_arn,
)

List the available pre-benchmarked configurations with the following code:

builder.display_benchmark_metrics()

Choose the appropriate instance_type and config_name from the list based on your concurrent users, latency, and throughput requirements. In the preceding table, you can see the latency and throughput across different concurrency levels for the given instance type and config name. If config name is lmi-optimized, that means the configuration is pre-optimized by SageMaker. Then you can call .build() to run the optimization job. When the job is complete, you can deploy to an endpoint and test the model predictions. See the following code:

# set deployment config with pre-configured optimization
bulder.set_deployment_config(
    instance_type="ml.g5.12xlarge", 
    config_name="lmi-optimized"
)

# build the deployable model
model = builder.build()

# deploy the model to a SageMaker endpoint
predictor = model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Using the inference optimization toolkit to create custom optimizations

In addition to creating a pre-optimized model, you can create custom optimizations based on the instance type you choose. The following table provides a full list of available combinations. In the following sections, we explore compilation on AWS Inferentia first, and then try the other optimization techniques for GPU instances.

Instance Types	Optimization Technique	Configurations
AWS Inferentia	Compilation	Neuron Compiler
GPUs	Quantization	AWQ
GPUs	Speculative Decoding	SageMaker provided or Bring Your Own (BYO) draft model

Compilation from SageMaker JumpStart

For compilation, you can select the same Meta-Llama-3-8b model from SageMaker JumpStart and choose Optimize on the model page. In the optimization configuration page, you can choose ml.inf2.8xlarge for your instance type. Then provide an output Amazon Simple Storage Service (Amazon S3) location for the optimized artifacts. For large models like Llama 2 70B, for example, the compilation job can take more than an hour. Therefore, we recommend using the inference optimization toolkit to perform ahead-of-time compilation. That way, you only need to compile one time.

Compilation using the SageMaker Python SDK

For the SageMaker Python SDK, you can configure the compilation by changing the environment variables in the .optimize() function. For more details on compilation_config, refer to LMI NeuronX ahead-of-time compilation of models tutorial.

compiled_model = builder.optimize(
    instance_type="ml.inf2.8xlarge",
    accept_eula=True,
    compilation_config={
        "OverrideEnvironment": {
            "OPTION_TENSOR_PARALLEL_DEGREE": "2",
            "OPTION_N_POSITIONS": "2048",
            "OPTION_DTYPE": "fp16",
            "OPTION_ROLLING_BATCH": "auto",
            "OPTION_MAX_ROLLING_BATCH_SIZE": "4",
            "OPTION_NEURON_OPTIMIZE_LEVEL": "2",
        }
   },
   output_path=f"s3://{output_bucket_name}/compiled/"
)

# deploy the compiled model to a SageMaker endpoint
predictor = compiled_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Quantization and speculative decoding from SageMaker JumpStart

For optimizing models on GPU, ml.g5.12xlarge is the default deployment instance type for Llama-3-8b. You can choose quantization, speculative decoding, or both as optimization options. Quantization uses AWQ to reduce the model’s weights to low-bit (INT4) representations. Finally, you can provide an output S3 URL to store the optimized artifacts.

With speculative decoding, you can improve latency and throughput by either using the SageMaker provided draft model or bringing your own draft model from the public Hugging Face model hub or upload from your own S3 bucket.

After the optimization job is complete, you can deploy the model or run further evaluation jobs on the optimized model. On the SageMaker Studio UI, you can choose to use the default sample datasets or provide your own using an S3 URI. At the time of writing, the evaluate performance option is only available through the Amazon SageMaker Studio UI.

Quantization and speculative decoding using the SageMaker Python SDK

The following is the SageMaker Python SDK code snippet for quantization. You just need to provide the quantization_config attribute in the .optimize() function.

optimized_model = builder.optimize(
    instance_type="ml.g5.12xlarge",
    accept_eula=True,
    quantization_config={
        "OverrideEnvironment": {
            "OPTION_QUANTIZE": "awq"
        }
    },
    output_path=f"s3://{output_bucket_name}/quantized/"
)

# deploy the optimized model to a SageMaker endpoint
predictor = optimized_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

For speculative decoding, you can change to a speculative_decoding_config attribute and configure SageMaker or a custom model. You may need to adjust the GPU utilization based on the sizes of the draft and target models to fit them both on the instance for inference.

optimized_model = builder.optimize(
    instance_type="ml.g5.12xlarge",
    accept_eula=True,
    speculative_decoding_config={
        "ModelProvider": "sagemaker"
    }
    # speculative_decoding_config={
        # "ModelProvider": "custom",
        # use S3 URI or HuggingFace model ID for custom draft model
        # note: using HuggingFace model ID as draft model requires HF_TOKEN in environment variables
        # "ModelSource": "s3://custom-bucket/draft-model", 
    # }
)

# deploy the optimized model to a SageMaker endpoint
predictor = optimized_model.deploy(accept_eula=True)

# use sample input payload to test the deployed endpoint
predictor.predict(sample_input)

Conclusion

Optimizing generative AI models for inference performance is crucial for delivering cost-effective and responsive generative AI solutions. With the launch of the inference optimization toolkit, you can now optimize your generative AI models, using the latest techniques such as speculative decoding, compilation, and quantization to achieve up to ~2x higher throughput and reduce costs by up to ~50%. This helps you achieve the optimal price-performance balance for your specific use cases with just a few clicks in SageMaker JumpStart or a few lines of code using the SageMaker Python SDK. The inference optimization toolkit significantly simplifies the model optimization process, enabling your business to accelerate generative AI adoption and unlock more opportunities to drive better business outcomes.

To learn more, refer to Optimize model inference with Amazon SageMaker and Achieve up to ~2x higher throughput while reducing costs by up to ~50% for generative AI inference on Amazon SageMaker with the new inference optimization toolkit – Part 1.

About the Authors

James Wu is a Senior AI/ML Specialist Solutions Architect
Saurabh Trikande is a Senior Product Manager
Rishabh Ray Chaudhury is a Senior Product Manager
Kumara Swami Borra is a Front End Engineer
Alwin (Qiyun) Zhao is a Senior Software Development Engineer
Qing Lan is a Senior SDE

Hugging Face Offers Developers Inference-as-a-Service Powered by NVIDIA NIM

One of the world’s largest AI communities — comprising 4 million developers on the Hugging Face platform — is gaining easy access to NVIDIA-accelerated inference on some of the most popular AI models. New inference-as-a-service capabilities will enable developers to rapidly deploy leading large language models such as the Llama…

July 30, 2024

In "FAANG"

Scaling high-performance inference cost-effectively

September 11, 2025

In "FAANG"

Amazon SageMaker AI in 2025, a year in review part 1: Flexible Training Plans and improvements to price performance for inference workloads

In 2025, Amazon SageMaker AI saw dramatic improvements to core infrastructure offerings along four dimensions: capacity, price performance, observability, and usability. In this series of posts, we discuss these various improvements and their benefits. In Part 1, we discuss capacity improvements with the launch of Flexible Training Plans. We also…

February 21, 2026

In "FAANG"

AI Generated Robotic Content

Next Reducing Hallucinations with the Ontology in Palantir AIP »

Previous « Enterprises embrace generative AI, but challenges remain

Published by

AI Generated Robotic Content

Tags: ai/mlfaang

2 years ago

Google’s new AI algorithm reduces memory 6x and increases speed 8x

https://arstechnica.com/ai/2026/03/google-says-new-turboquant-compression-can-lower-ai-memory-usage-without-sacrificing-quality/ submitted by /u/pheonis2 [link] [comments]

12 hours ago