image001
This post is co-written with Eliuth Triana, Abhishek Sawarkar, Jiahong Liu, Kshitiz Gupta, JR Morgan and Deepika Padmanabhan from NVIDIA.
At the 2024 NVIDIA GTC conference, we announced support for NVIDIA NIM Inference Microservices in Amazon SageMaker Inference. This integration allows you to deploy industry-leading large language models (LLMs) on SageMaker and optimize their performance and cost. The optimized prebuilt containers enable the deployment of state-of-the-art LLMs in minutes instead of days, facilitating their seamless integration into enterprise-grade AI applications.
NIM is built on technologies like NVIDIA TensorRT, NVIDIA TensorRT-LLM, and vLLM. NIM is engineered to enable straightforward, secure, and performant AI inferencing on NVIDIA GPU-accelerated instances hosted by SageMaker. This allows developers to take advantage of the power of these advanced models using SageMaker APIs and just a few lines of code, accelerating the deployment of cutting-edge AI capabilities within their applications.
NIM, part of the NVIDIA AI Enterprise software platform listed on AWS Marketplace, is a set of inference microservices that bring the power of state-of-the-art LLMs to your applications, providing natural language processing (NLP) and understanding capabilities, whether you’re developing chatbots, summarizing documents, or implementing other NLP-powered applications. You can use pre-built NVIDIA containers to host popular LLMs that are optimized for specific NVIDIA GPUs for quick deployment. Companies like Amgen, A-Alpha Bio, Agilent, and Hippocratic AI are among those using NVIDIA AI on AWS to accelerate computational biology, genomics analysis, and conversational AI.
In this post, we provide a walkthrough of how customers can use generative artificial intelligence (AI) models and LLMs using NVIDIA NIM integration with SageMaker. We demonstrate how this integration works and how you can deploy these state-of-the-art models on SageMaker, optimizing their performance and cost.
You can use the optimized pre-built NIM containers to deploy LLMs and integrate them into your enterprise-grade AI applications built with SageMaker in minutes, rather than days. We also share a sample notebook that you can use to get started, showcasing the simple APIs and few lines of code required to harness the capabilities of these advanced models.
Getting started with NIM is straightforward. Within the NVIDIA API catalog, developers have access to a wide range of NIM optimized AI models that you can use to build and deploy your own AI applications. You can get started with prototyping directly in the catalog using the GUI (as shown in the following screenshot) or interact directly with the API for free.
To deploy NIM on SageMaker, you need to download NIM and subsequently deploy it. You can initiate this process by choosing Run Anywhere with NIM for the model of your choice, as shown in the following screenshot.
You can sign up for the free 90-day evaluation license on the API Catalog by signing up with your organization email address. This will grant you a personal NGC API key for pulling the assets from NGC and running on SageMaker. For pricing details on SageMaker, refer to Amazon SageMaker pricing.
As a prerequisite, set up an Amazon SageMaker Studio environment:
For this series of steps, we use a SageMaker Studio JupyterLab notebook. You also need to attach an Amazon Elastic Block Store (Amazon EBS) volume of at least 300 MB in size, which you can do in the domain settings for SageMaker Studio. In this example, we use an ml.g5.4xlarge instance, powered by a NVIDIA A10G GPU.
We start by opening the example notebook provided on our JupyterLab instance, import the corresponding packages, and set up the SageMaker session, role, and account information:
The NIM container that comes with SageMaker integration built in is available in the Amazon ECR Public Gallery. To deploy it on your own SageMaker account securely, you can pull the Docker container from the public Amazon Elastic Container Registry (Amazon ECR) container maintained by NVIDIA and re-upload it to your own private container:
NIMs can be accessed using the NVIDIA API catalog. You just need to register for an NVIDIA API key from the NGC catalog by choosing Generate Personal Key.
When creating an NGC API key, choose at least NGC Catalog on the Services Included dropdown menu. You can include more services if you plan to reuse this key for other purposes.
For the purposes of this post, we store it in an environment variable:
NGC_API_KEY = YOUR_KEY
This key is used to download pre-optimized model weights when running the NIM.
We now have all the resources prepared to deploy to a SageMaker endpoint. Using your notebook after setting up your Boto3 environment, you first need to make sure you reference the container you pushed to Amazon ECR in an earlier step:
After the model definition is set up correctly, the next step is to define the endpoint configuration for deployment. In this example, we deploy the NIM on one ml.g5.4xlarge instance:
Lastly, create the SageMaker endpoint:
After the endpoint is deployed successfully, you can run requests against the NIM-powered SageMaker endpoint using the REST API to try out different questions and prompts to interact with the generative AI models:
That’s it! You now have an endpoint in service using NIM on SageMaker.
NIM is part of the NVIDIA Enterprise License. NIM comes with a 90-day evaluation license to start with. To use NIMs on SageMaker beyond the 90-day license, connect with NVIDIA for AWS Marketplace private pricing. NIM is also available as a paid offering as part of the NVIDIA AI Enterprise software subscription available on AWS Marketplace
In this post, we showed you how to get started with NIM on SageMaker for pre-built models. Feel free to try it out following the example notebook.
We encourage you to explore NIM to adopt it to benefit your own use cases and applications.
You can find the workflow by scrolling down on this page: https://comfyanonymous.github.io/ComfyUI_examples/flux/ submitted by /u/comfyanonymous…
Machine learning practitioners spend countless hours on repetitive tasks: monitoring model performance, retraining pipelines, data…
This post is divided into four parts; they are: • Why Attention Masking is Needed…
Artificial intelligence (AI) is an umbrella computer science discipline focused on building software systems capable…
With advances in generative AI, there is increasing work towards creating autonomous agents that can…
Amazon Bedrock Guardrails provides configurable safeguards to help build trusted generative AI applications at scale.…