Deploy Meta Llama 3.1 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

We’re excited to announce the availability of Meta Llama 3.1 8B and 70B inference support on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. Meta Llama 3.1 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models. Trainium and Inferentia, enabled by the AWS Neuron software development kit (SDK), offer high performance and lower the cost of deploying Meta Llama 3.1 by up to 50%.

In this post, we demonstrate how to deploy Meta Llama 3.1 on Trainium and Inferentia instances in SageMaker JumpStart.

What is the Meta Llama 3.1 family?

The Meta Llama 3.1 multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support a long context length (128,000) and are optimized for inference with support for grouped query attention (GQA). The Meta Llama 3.1 instruction tuned text-only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source chat models on common industry benchmarks.

At its core, Meta Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. Architecturally, the core LLM for Meta Llama 3 and Meta Llama 3.1 is the same dense architecture.

Meta Llama 3.1 also offers instruct variants, and the instruct model is fine-tuned for tool use. The model has been trained to generate calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, the model also supports zero-shot tool use.

The responsible use guide from Meta can assist you in additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

What is SageMaker JumpStart?

SageMaker JumpStart offers access to a broad selection of publicly available foundation models (FMs). These pre-trained models serve as powerful starting points that can be deeply customized to address specific use cases. You can now use state-of-the-art model architectures, such as language models, computer vision models, and more, without having to build them from scratch.

With SageMaker JumpStart, you can deploy models in a secure environment. The models are provisioned on dedicated SageMaker Inference instances, including Trainium and Inferentia powered instances, and are isolated within your virtual private cloud (VPC). This provides data security and compliance, because the models operate under your own VPC controls, rather than in a shared public environment. After deploying an FM, you can further customize and fine-tune it using the extensive capabilities of SageMaker, including SageMaker Inference for deploying models and container logs for improved observability. With SageMaker, you can streamline the entire model deployment process.

Solution overview

SageMaker JumpStart provides FMs through two primary interfaces: Amazon SageMaker Studio and the SageMaker Python SDK. This provides multiple options to discover and use hundreds of models for your specific use case.

SageMaker Studio is a comprehensive interactive development environment (IDE) that offers a unified, web-based interface for performing all aspects of the machine learning (ML) development lifecycle. From preparing data to building, training, and deploying models, SageMaker Studio provides purpose-built tools to streamline the entire process. In SageMaker Studio, you can access SageMaker JumpStart to discover and explore the extensive catalog of FMs available for deployment to inference capabilities on SageMaker Inference.

In SageMaker Studio, you can access SageMaker JumpStart by choosing JumpStart in the navigation pane or by choosing JumpStart on the Home page.

Alternatively, you can use the SageMaker Python SDK to programmatically access and use JumpStart models. This approach allows for greater flexibility and integration with existing AI and ML workflows and pipelines. By providing multiple access points, SageMaker JumpStart helps you seamlessly incorporate pre-trained models into your AI and ML development efforts, regardless of your preferred interface or workflow.

In the following sections, we demonstrate how to deploy Meta Llama 3.1 on Trainium instances using SageMaker JumpStart in SageMaker Studio for a one-click deployment and the Python SDK.

Prerequisites

To try out this solution using SageMaker JumpStart, you need the following prerequisites:

An AWS account that will contain all your AWS resources.
An AWS Identity and Access Management (IAM) role to access SageMaker. To learn more about how IAM works with SageMaker, refer to Identity and Access Management for Amazon SageMaker.
Access to SageMaker Studio or a SageMaker notebook instance, or an IDE) such as PyCharm or Visual Studio Code. We recommend using SageMaker Studio for straightforward deployment and inference.
One instance of ml.trn1.32xlarge for SageMaker hosting.

Deploy Meta Llama 3.1 using the SageMaker JumpStart UI

From the SageMaker JumpStart landing page, you can browse for models, notebooks, and other resources. You can find Meta Llama 3.1 Neuron models by searching by “3.1” or find them in the Meta hub.

If you don’t see Meta Llama 3.1 Neuron models in SageMaker Studio Classic, update your SageMaker Studio version by shutting down and restarting. For more information about version updates, refer to Shut down and Update Studio Apps.

In SageMaker JumpStart, you can access the Meta Llama 3.1 Neuron models listed in the following table.

Model Card	Description	Key Capabilities
Meta Llama 3.1 8B Neuron	`Llama-3.1-8B` is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation supported in 10 languages.	Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 8B Instruct Neuron	`Llama-3.1-8B-Instruct` is an update to `Meta-Llama-3-8B-Instruct`, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities.	Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.
Meta Llama 3.1 70B Neuron	`Llama-3.1-70B` is a state-of-the-art openly accessible model that excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation in 10 languages.	Multilingual support and stronger reasoning capabilities, enabling advanced use cases like long-form text summarization and multilingual conversational agents.
Meta Llama 3.1 70B Instruct Neuron	`Llama-3.1-70B-Instruct` is an update to `Meta-Llama-3-70B-Instruct`, an assistant-like chat model, that includes an expanded 128,000 context length, multilinguality, and improved reasoning capabilities	Able to follow instructions and tasks, improved reasoning and understanding of nuances and context, and multilingual translation.

You can choose the model card to view details about the model such as license, data used to train, and how to use.

You can also find two buttons on the model details page, Deploy and Preview notebooks, which help you use the model.

When you choose Deploy, a pop-up will show the end-user license agreement and acceptable use policy for you to acknowledge.

When you acknowledge the terms choose Deploy, model deployment will start.

Deploy Meta Llama 3.1 using the Python SDK

Alternatively, you can deploy through the example notebook available from the model page by choosing Preview notebooks. The example notebook provides end-to-end guidance on how to deploy the model for inference and clean up resources.

To deploy using a notebook, we start by selecting an appropriate model, specified by the model_id. For example, you can deploy a Meta Llama 3.1 70B Instruct model through SageMaker JumpStart with the following SageMaker Python SDK code:

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct") 
predictor = model.deploy(accept_eula=True)

This deploys the model on SageMaker with default configurations, including default instance type and default VPC configurations. You can change these configurations by specifying non-default values in JumpStartModel. To successfully deploy the model, you must manually set accept_eula=True as a deploy method argument. After it’s deployed, you can run inference against the deployed endpoint through the SageMaker predictor:

payload = {
    "inputs": "The color of the sky is blue but sometimes it can also be ",
    "parameters": {"max_new_tokens":256, "top_p":0.9, "temperature":0.6}
}
response = predictor.predict(payload)

The following table lists all the Meta Llama models available in SageMaker JumpStart, along with the model_id, default instance type, and supported instance types for each model.

Model Card	Model ID	Default Instance Type	Supported Instance Types
Meta Llama 3 1 8B Neuron	`meta-textgenerationneuron-llama-3-1-8b`	ml.inf2.48xlarge	ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 8B Instruct Neuron	`meta-textgenerationneuron-llama-3-1-8b-instruct`	ml.inf2.48xlarge	ml.inf2.8xlarge, ml.inf2.24xlarge, ml.inf2.48xlarge, ml.trn1.2xlarge, ml.trn1.32xlarge, ml.trn1n.32xlarge
Meta Llama 3.1 70B Neuron	`meta-textgenerationneuron-llama-3-1-70b`	ml.trn1.32xlarge	ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge
Meta Llama 3.1 70B Instruct Neuron	`meta-textgenerationneuron-llama-3-1-70b-instruct`	ml.trn1.32xlarge	ml.trn1.32xlarge, ml.trn1n.32xlarge, ml.inf2.48xlarge

If you want more control of the deployment configurations, such as context length, tensor parallel degree, and maximum rolling batch size, you can modify them using environmental variables. The underlying Deep Learning Container (DLC) of the deployment is the Large Model Inference (LMI) NeuronX DLC. Refer to the LMI user guide for the supported environment variables.

SageMaker JumpStart has pre-compiled Neuron graphs for a variety of configurations for the preceding parameters to avoid runtime compilation. The configurations of pre-compiled graphs are listed in the following table. As long as the environmental variables fall into one of the following categories, compilation of Neuron graphs will be skipped.

Meta Llama 3.1 8B and Meta Llama 3.1 8B Instruct
OPTION_N_POSITIONS	OPTION_MAX_ROLLING_BATCH_SIZE	OPTION_TENSOR_PARALLEL_DEGREE	OPTION_DTYPE
8192	8	2	bf16
8192	8	4	bf16
8192	8	8	bf16
8192	8	12	bf16
8192	8	24	bf16
8192	8	32	bf16
Meta Llama 3.1 70B and Meta Llama 3.1 70B Instruct
OPTION_N_POSITIONS	OPTION_MAX_ROLLING_BATCH_SIZE	OPTION_TENSOR_PARALLEL_DEGREE	OPTION_DTYPE
8192	8	24	bf16
8192	8	32	bf16

The following is an example of deploying Meta Llama 3.1 70B Instruct and setting all the available configurations:

from sagemaker.jumpstart.model import JumpStartModel 

model_id = "meta-textgenerationneuron-llama-3-1-70b-instruct"

model = JumpStartModel( 
    model_id=model_id, 
    env={
        "OPTION_DTYPE": "bf16",
        "OPTION_N_POSITIONS": "8192",
        "OPTION_TENSOR_PARALLEL_DEGREE": "24",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
     }, 
     instance_type="ml.trn1.32xlarge"
) 

pretrained_predictor = model.deploy(accept_eula=True)

Now that you have deployed the Meta Llama 3.1 70B Instruct model, you can run inference with it by invoking the endpoint. The following code snippet demonstrates using the supported inference parameters to control text generation:

{
    'inputs': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>nnAlways answer with Haiku<|eot_id|><|start_header_id|>user<|end_header_id|>nnI am going to Paris, what should I see?<|eot_id|><|start_header_id|>assista<|end_header_id|>nn',
    'parameters': {
        'max_new_tokens': 256,
        'top_p': 0.9,
        'temperature': 0.6,
        'stop': '<|eot_id|>'
    }
}

response = pretrained_predictor.predict(payload)

We get the following output:

{'generated_text': "Eiffel's iron lacenRiver Seine's gentle flow bynMontmartre's charm calls<|eot_id|>"}

For more information on the parameters in the payload, refer to Parameters.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

pretrained_predictor.delete_predictor()

Conclusion

The deployment of Meta Llama 3.1 Neuron models on SageMaker demonstrates a significant advancement in managing and optimizing large-scale generative AI models with reduced costs up to 50% compared to GPU. These models, including variants like Meta Llama 3.1 8B and 70B, use Neuron for efficient inference on Inferentia and Trainium based instances, enhancing their performance and scalability.

The ability to deploy these models through the SageMaker JumpStart UI and Python SDK offers flexibility and ease of use. The Neuron SDK, with its support for popular ML frameworks and high-performance capabilities, enables efficient handling of these large models.

For more information on deploying and fine-tuning pre-trained Meta Llama 3.1 models on GPU-based instances, refer to Llama 3.1 models are now available in Amazon SageMaker JumpStart and Fine-tune Meta Llama 3.1 models for generative AI inference using Amazon SageMaker JumpStart.

About the authors

Sharon Yu is a Software Development Engineer with Amazon SageMaker based in New York City.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.