Categories: FAANG

Deploy DeepSeek-R1 Distilled Llama models in Amazon Bedrock

ML18202 image001

Open foundation models (FMs) have become a cornerstone of generative AI innovation, enabling organizations to build and customize AI applications while maintaining control over their costs and deployment strategies. By providing high-quality, openly available models, the AI community fosters rapid iteration, knowledge sharing, and cost-effective solutions that benefit both developers and end-users. DeepSeek AI, a research company focused on advancing AI technology, has emerged as a significant contributor to this ecosystem. Their DeepSeek-R1 models represent a family of large language models (LLMs) designed to handle a wide range of tasks, from code generation to general reasoning, while maintaining competitive performance and efficiency.

Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing FMs through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Bedrock tools and features like Knowledge Bases, Guardrails, and Agents.

In this post, we explore how to deploy distilled versions of DeepSeek-R1 with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the secure and scalable AWS infrastructure at an effective cost.

DeepSeek-R1 distilled variations

From the foundation of DeepSeek-R1, DeepSeek AI has created a series of distilled models based on both Meta’s Llama and Qwen architectures, ranging from 1.5–70 billion parameters. The distillation process involves training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model by using it as a teacher—essentially transferring the knowledge and capabilities of the 671 billion parameter model into more compact architectures. The resulting distilled models, such as DeepSeek-R1-Distill-Llama-8B (from base model Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base model Llama-3.3-70B-Instruct), offer different trade-offs between performance and resource requirements. Although distilled models might show some reduction in reasoning capabilities compared to the original 671B model, they significantly improve inference speed and reduce computational costs. For instance, smaller distilled models like the 8B version can process requests much faster and consume fewer resources, making them more cost-effective for production deployments, whereas larger distilled versions like the 70B model maintain closer performance to the original while still offering meaningful efficiency gains.

Solution overview

In this post, we demonstrate how to deploy distilled versions of DeepSeek-R1 models using Amazon Bedrock Custom Model Import. We focus on importing the variants currently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which offer an optimal balance between performance and resource efficiency. You can import these models from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI model repo, and deploy them in a fully managed and serverless environment through Amazon Bedrock. The following diagram illustrates the end-to-end flow.

In this workflow, model artifacts stored in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the model automatically. This serverless approach eliminates the need for infrastructure management while providing enterprise-grade security and scalability.

You can use the Amazon Bedrock console for deploying using the graphical interface and following the instructions in this post, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.

Prerequisites

You should have the following prerequisites:

An AWS account with access to Amazon Bedrock.
Appropriate AWS Identity and Access Management (IAM) roles and permissions for Amazon Bedrock and Amazon S3. For more information, see Create a service role for model import.
An S3 bucket prepared to store the custom model. For more information, see Creating a bucket.
Sufficient local storage space, at least 17 GB for the 8B model or 135 GB for the 70B model.

Prepare the model package

Complete the following steps to prepare the model package:

Download the DeepSeek-R1-Distill-Llama model artifacts from Hugging Face, from one of the following links, depending on the model you want to deploy:
1. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/tree/main
2. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B/tree/main

For more information, you can follow the Hugging Face’s Downloading models or Download files from the hub instructions.

You typically need the following files:

- Model configuration file: config.json
- Tokenizer files: tokenizer.json, tokenizer_config.json, and tokenizer.mode
- Model weights files in .safetensors format

Upload these files to a folder in your S3 bucket, in the same AWS Region where you plan to use Amazon Bedrock. Take note of the S3 path you’re using.

Import the model

Complete the following steps to import the model:

On the Amazon Bedrock console, choose Imported models under Foundation models in the navigation pane.

Choose Import model.

For Model name, enter a name for your model (it’s recommended to use a versioning scheme in your name, for tracking your imported model).
For Import job name, enter a name for your import job.
For Model import settings, select Amazon S3 bucket as your import source, and enter the S3 path you noted earlier (provide the full path in the form s3://<your-bucket>/folder-with-model-artifacts/).
For Encryption, optionally choose to customize your encryption settings.
For Service access role, choose to either create a new IAM role or provide your own.
Choose Import model.

Importing the model will take several minutes depending on the model being imported (for example, the Distill-Llama-8B model could take 5–20 minutes to complete).

Watch this video demo for a step-by-step guide.

Test the imported model

After you import the model, you can test it by using the Amazon Bedrock Playground or directly through the Amazon Bedrock invocation APIs. To use the Playground, complete the following steps:

On the Amazon Bedrock console, choose Chat / Text under Playgrounds in the navigation pane.
From the model selector, choose your imported model name.
Adjust the inference parameters as needed and write your test prompt. For example:
<｜begin▁of▁sentence｜><｜User｜>Given the following financial data: - Company A's revenue grew from $10M to $15M in 2023 - Operating costs increased by 20% - Initial operating costs were $7M Calculate the company's operating margin for 2023. Please reason step by step, and put your final answer within \boxed{}<｜Assistant｜>

As we’re using an imported model in the playground, we must include the “beginning_of_sentence” and “user/assistant” tags to properly format the context for DeepSeek models; these tags help the model understand the structure of the conversation and provide more accurate responses. If you’re following the programmatic approach in the following notebook then this is being automatically taken care of by configuring the model.

Review the model response and metrics provided.

Note: When you invoke the model for the first time, if you encounter a ModelNotReadyException error the SDK automatically retries the request with exponential backoff. The restoration time varies depending on the on-demand fleet size and model size. You can customize the retry behavior using the AWS SDK for Python (Boto3) Config object. For more information, see Handling ModelNotReadyException.

Once you are ready to import the model, use this step-by-step video demo to help you get started.

Pricing

Custom Model Import enables you to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import does not charge for model import, you are charged for inference based on two factors: the number of active model copies and their duration of activity.

Billing occurs in 5-minute windows, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, region, and compute unit version, and is tiered by model copy size. The Custom Model Units required for hosting depends on the model’s architecture, parameter count, and context length, with examples ranging from 2 Units for a Llama 3.1 8B 128K model to 8 Units for a Llama 3.1 70B 128K model.

Amazon Bedrock automatically manages scaling, maintaining zero to three model copies by default (adjustable through Service Quotas) based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this may involve cold-start latency of tens of seconds. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.

Consider the following pricing example: An application developer imports a customized Llama 3.1 type model that is 8B parameter in size with a 128K sequence length in us-east-1 region and deletes the model after 1 month. This requires 2 Custom Model Units. So, the price per minute will be $0.1570 and the model storage costs will be $3.90 for the month.

For more information, see Amazon Bedrock pricing.

Benchmarks

DeepSeek has published benchmarks comparing their distilled models against the original DeepSeek-R1 and base Llama models, available in the model repositories. The benchmarks show that depending on the task DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the original model’s reasoning capabilities, while the 8B version achieves between 59-92% performance with significantly reduced resource requirements. Both distilled versions demonstrate improvements over their corresponding base Llama models in specific reasoning tasks.

Other considerations

When deploying DeepSeek models in Amazon Bedrock, consider the following aspects:

Model versioning is essential. Because Custom Model Import creates unique models for each import, implement a clear versioning strategy in your model names to track different versions and variations.
The current supported model formats focus on Llama-based architectures. Although DeepSeek-R1 distilled versions offer excellent performance, the AI ecosystem continues evolving rapidly. Keep an eye on the Amazon Bedrock model catalog as new architectures and larger models become available through the platform.
Evaluate your use case requirements carefully. Although larger models like DeepSeek-R1-Distill-Llama-70B provide better performance, the 8B version might offer sufficient capability for many applications at a lower cost.
Consider implementing monitoring and observability. Amazon CloudWatch provides metrics for your imported models, helping you track usage patterns and performance. You can monitor costs with AWS Cost Explorer.
Start with a lower concurrency quota and scale up based on actual usage patterns. The default limit of three concurrent model copies per account is suitable for most initial deployments.

Conclusion

Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like DeepSeek-R1 distilled versions, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of DeepSeek’s innovative distillation approach and the Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.

The ability to choose between proprietary and open FMs Amazon Bedrock gives organizations the flexibility to optimize for their specific needs. Open models enable cost-effective deployment with full control over the model artifacts, making them ideal for scenarios where customization, cost optimization, or model transparency are crucial. This flexibility, combined with the Amazon Bedrock unified API and enterprise-grade infrastructure, allows organizations to build resilient AI strategies that can adapt as their requirements evolve.

For more information, refer to the Amazon Bedrock User Guide.

About the Authors

Raj Pathak is a Principal Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Generative AI, Natural Language Processing, Intelligent Document Processing, and MLOps.

Yanyan Zhang is a Senior Generative AI Data Scientist at Amazon Web Services, where she has been working on cutting-edge AI/ML technologies as a Generative AI Specialist, helping customers use generative AI to achieve their desired outcomes. Yanyan graduated from Texas A&M University with a PhD in Electrical Engineering. Outside of work, she loves traveling, working out, and exploring new things.

Ishan Singh is a Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Morgan Rankey is a Solutions Architect based in New York City, specializing in Hedge Funds. He excels in assisting customers to build resilient workloads within the AWS ecosystem. Prior to joining AWS, Morgan led the Sales Engineering team at Riskified through its IPO. He began his career by focusing on AI/ML solutions for machine asset management, serving some of the largest automotive companies globally.

Harsh Patel is an AWS Solutions Architect supporting 200+ SMB customers across the United States to drive digital transformation through cloud-native solutions. As an AI&ML Specialist, he focuses on Generative AI, Computer Vision, Reinforcement Learning and Anomaly Detection. Outside the tech world, he recharges by hitting the golf course and embarking on scenic hikes with his dog.

AI Generated Robotic Content