ml 18766 image001
Today, we’re excited to announce the launch of Amazon SageMaker Large Model Inference (LMI) container v15, powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Meta’s Llama 4 models Scout and Maverick, Google’s Gemma 3, Alibaba’s Qwen, Mistral AI, DeepSeek-R, and many more. Amazon SageMaker AI continues to evolve its generative AI inference capabilities to meet the growing demands in performance and model support for foundation models (FMs).
This release introduces significant performance improvements, expanded model compatibility with multimodality (that is, the ability to understand and analyze text-to-text, images-to-text, and text-to-images data), and provides built-in integration with vLLM to help you seamlessly deploy and serve large language models (LLMs) with the highest performance at scale.
LMI v15 brings several enhancements that improve throughput, latency, and usability:
VLLM_USE_V1=0
. vLLM V1’s engine also comes with a core re-architecture of the serving engine with simplified scheduling, zero-overhead prefix caching, clean tensor-parallel inference, efficient input preparation, and advanced optimizations with torch.compile and Flash Attention 3. For more information, see the vLLM Blog.LMI v15 supports an expanding roster of state-of-the-art models, including the latest releases from leading model providers. The container offers ready-to-deploy compatibility for but not limited to:
Each model family can be deployed using the LMI v15 container by specifying the appropriate model ID, for example, meta-llama/Llama-4-Scout-17B-16E, and configuration parameters as environment variables, without requiring custom code or optimization work.
Our benchmarks demonstrate the performance advantages of LMI v15’s V1 engine compared to previous versions:
Model | Batch size | Instance type | LMI v14 throughput [tokens/s] (V0 engine) | LMI v15 throughput [tokens/s] (V1 engine) | Improvement | |
1 | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 128 | p4d.24xlarge | 1768 | 2198 | 24% |
2 | meta-llama/Llama-3.1-8B-Instruct | 64 | ml.g6e.2xlarge | 1548 | 2128 | 37% |
3 | mistralai/Mistral-7B-Instruct-v0.3 | 64 | ml.g6e.2xlarge | 942 | 1988 | 111% |
DeepSeek-R1 Llama 70B for various levels of concurrency
Llama 3.1 8B Instruct for various level of concurrency
Mistral 7B for various levels of concurrency
The async engine in LMI v15 shows strength in high-concurrency scenarios, where multiple simultaneous requests benefit from the optimized request handling. These benchmarks highlight that the V1 engine in async mode delivers between 24% and 111% higher throughput compared to LMI v14 using rolling batch in the models tested in high concurrency scenarios for batch size of 64 and 128. We suggest to keep in mind the following considerations for optimal performance:
LMI v15 supports three API schemas: OpenAI Chat Completions, OpenAI Completions, and TGI.
Getting started with LMI v15 is seamless, and you can deploy with LMI v15 in only a few lines of code. The container is available through Amazon Elastic Container Registry (Amazon ECR), and deployments can be managed through SageMaker AI endpoints. To deploy models, you need to specify the Hugging Face model ID, instance type, and configuration options as environment variables.
For optimal performance, we recommend the following instances:
To deploy with LMI v15, follow these steps:
OPTION_<CONFIG_NAME>
. This consistent approach makes it straightforward for users familiar with earlier LMI versions to migrate to v15. HF_MODEL_ID
sets the model id from Hugging Face. You can also download model from Amazon Simple Storage Service (Amazon S3).HF_TOKEN
sets the token to download the model. This is required for gated models like Llama-4OPTION_MAX_MODEL_LEN
. This is the max model context length.OPTION_MAX_ROLLING_BATCH_SIZE
sets the batch size for the model.OPTION_MODEL_LOADING_TIMEOUT
sets the timeout value for SageMaker to load the model and run health checks.SERVING_FAIL_FAST=true
. We recommend setting this flag because it allows SageMaker to gracefully restart the container when an unrecoverable engine error occurs.OPTION_ROLLING_BATCH= disable
disables the rolling batch implementation of LMI, which was the default offering in LMI V14. We recommend using async instead as this latest implementation and provides better performanceOPTION_ASYNC_MODE=true
enables async mode.OPTION_ENTRYPOINT
provides the entrypoint for vLLM’s async integrations0.33.0-lmi15.0.0-cu128
), AWS Region (us-east-1
), and create a model artifact with all the configurations. To review the latest available container version, see Available Deep Learning Containers Images.model.deploy()
. InvokeEndpoint
and InvokeEndpointWithResponseStream
. You can choose either option based on your needs. To run multi-modal inference with Llama-4 Scout, see the notebook for the full code sample to run inference requests with images.
Amazon SageMaker LMI container v15 represents a significant step forward in large model inference capabilities. With the new vLLM V1 engine, async operating mode, expanded model support, and optimized performance, you can deploy cutting-edge LLMs with greater performance and flexibility. The container’s configurable options give you the flexibility to fine-tune deployments for your specific needs, whether optimizing for latency, throughput, or cost.
We encourage you to explore this release for deploying your generative AI models.
Check out the provided example notebooks to start deploying models with LMI v15.
I know there are models available that can fill in or edit parts, but I'm…
As we look ahead, the relationship between engineers and AI systems will likely evolve from…
Lightweight, powerful, and generally inexpensive, the handheld vacuum is the perfect household helper.
Discover how latent bridge matching, pioneered by the Jasper research team, transforms image-to-image translation with…
Machine learning models have become increasingly sophisticated, but this complexity often comes at the cost…