Open foundation models (FMs) have become a cornerstone of generative AI innovation, enabling organizations to build and customize AI applications while maintaining control over their costs and deployment strategies. By providing high-quality, openly available models, the AI community fosters rapid iteration, knowledge sharing, and cost-effective solutions that benefit both developers and end-users. DeepSeek AI, a research company focused on advancing AI technology, has emerged as a significant contributor to this ecosystem. Their DeepSeek-R1 models represent a family of large language models (LLMs) designed to handle a wide range of tasks, from code generation to general reasoning, while maintaining competitive performance and efficiency.
Amazon Bedrock Custom Model Import enables the import and use of your customized models alongside existing FMs through a single serverless, unified API. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Accelerate your generative AI application development by integrating your supported custom models with native Bedrock tools and features like Knowledge Bases, Guardrails, and Agents.
In this post, we explore how to deploy distilled versions of DeepSeek-R1 with Amazon Bedrock Custom Model Import, making them accessible to organizations looking to use state-of-the-art AI capabilities within the secure and scalable AWS infrastructure at an effective cost.
From the foundation of DeepSeek-R1, DeepSeek AI has created a series of distilled models based on both Meta’s Llama and Qwen architectures, ranging from 1.5–70 billion parameters. The distillation process involves training smaller, more efficient models to mimic the behavior and reasoning patterns of the larger DeepSeek-R1 model by using it as a teacher—essentially transferring the knowledge and capabilities of the 671 billion parameter model into more compact architectures. The resulting distilled models, such as DeepSeek-R1-Distill-Llama-8B (from base model Llama-3.1-8B) and DeepSeek-R1-Distill-Llama-70B (from base model Llama-3.3-70B-Instruct), offer different trade-offs between performance and resource requirements. Although distilled models might show some reduction in reasoning capabilities compared to the original 671B model, they significantly improve inference speed and reduce computational costs. For instance, smaller distilled models like the 8B version can process requests much faster and consume fewer resources, making them more cost-effective for production deployments, whereas larger distilled versions like the 70B model maintain closer performance to the original while still offering meaningful efficiency gains.
In this post, we demonstrate how to deploy distilled versions of DeepSeek-R1 models using Amazon Bedrock Custom Model Import. We focus on importing the variants currently supported DeepSeek-R1-Distill-Llama-8B and DeepSeek-R1-Distill-Llama-70B, which offer an optimal balance between performance and resource efficiency. You can import these models from Amazon Simple Storage Service (Amazon S3) or an Amazon SageMaker AI model repo, and deploy them in a fully managed and serverless environment through Amazon Bedrock. The following diagram illustrates the end-to-end flow.
In this workflow, model artifacts stored in Amazon S3 are imported into Amazon Bedrock, which then handles the deployment and scaling of the model automatically. This serverless approach eliminates the need for infrastructure management while providing enterprise-grade security and scalability.
You can use the Amazon Bedrock console for deploying using the graphical interface and following the instructions in this post, or alternatively use the following notebook to deploy programmatically with the Amazon Bedrock SDK.
You should have the following prerequisites:
Complete the following steps to prepare the model package:
For more information, you can follow the Hugging Face’s Downloading models or Download files from the hub instructions.
You typically need the following files:
config.json
tokenizer.json
, tokenizer_config.json
, and tokenizer.mode
.safetensors
formatComplete the following steps to import the model:
s3://<your-bucket>/folder-with-model-artifacts/
).Importing the model will take several minutes depending on the model being imported (for example, the Distill-Llama-8B model could take 5–20 minutes to complete).
Watch this video demo for a step-by-step guide.
After you import the model, you can test it by using the Amazon Bedrock Playground or directly through the Amazon Bedrock invocation APIs. To use the Playground, complete the following steps:
<|begin▁of▁sentence|><|User|>Given the following financial data: - Company A's revenue grew from $10M to $15M in 2023 - Operating costs increased by 20% - Initial operating costs were $7M Calculate the company's operating margin for 2023. Please reason step by step, and put your final answer within \boxed{}<|Assistant|>
As we’re using an imported model in the playground, we must include the “beginning_of_sentence” and “user/assistant” tags to properly format the context for DeepSeek models; these tags help the model understand the structure of the conversation and provide more accurate responses. If you’re following the programmatic approach in the following notebook then this is being automatically taken care of by configuring the model.
Note: When you invoke the model for the first time, if you encounter a ModelNotReadyException
error the SDK automatically retries the request with exponential backoff. The restoration time varies depending on the on-demand fleet size and model size. You can customize the retry behavior using the AWS SDK for Python (Boto3) Config object. For more information, see Handling ModelNotReadyException.
Once you are ready to import the model, use this step-by-step video demo to help you get started.
Custom Model Import enables you to use your custom model weights within Amazon Bedrock for supported architectures, serving them alongside Amazon Bedrock hosted FMs in a fully managed way through On-Demand mode. Custom Model Import does not charge for model import, you are charged for inference based on two factors: the number of active model copies and their duration of activity.
Billing occurs in 5-minute windows, starting from the first successful invocation of each model copy. The pricing per model copy per minute varies based on factors including architecture, context length, region, and compute unit version, and is tiered by model copy size. The Custom Model Units required for hosting depends on the model’s architecture, parameter count, and context length, with examples ranging from 2 Units for a Llama 3.1 8B 128K model to 8 Units for a Llama 3.1 70B 128K model.
Amazon Bedrock automatically manages scaling, maintaining zero to three model copies by default (adjustable through Service Quotas) based on your usage patterns. If there are no invocations for 5 minutes, it scales to zero and scales up when needed, though this may involve cold-start latency of tens of seconds. Additional copies are added if inference volume consistently exceeds single-copy concurrency limits. The maximum throughput and concurrency per copy is determined during import, based on factors such as input/output token mix, hardware type, model size, architecture, and inference optimizations.
Consider the following pricing example: An application developer imports a customized Llama 3.1 type model that is 8B parameter in size with a 128K sequence length in us-east-1 region and deletes the model after 1 month. This requires 2 Custom Model Units. So, the price per minute will be $0.1570 and the model storage costs will be $3.90 for the month.
For more information, see Amazon Bedrock pricing.
DeepSeek has published benchmarks comparing their distilled models against the original DeepSeek-R1 and base Llama models, available in the model repositories. The benchmarks show that depending on the task DeepSeek-R1-Distill-Llama-70B maintains between 80-90% of the original model’s reasoning capabilities, while the 8B version achieves between 59-92% performance with significantly reduced resource requirements. Both distilled versions demonstrate improvements over their corresponding base Llama models in specific reasoning tasks.
When deploying DeepSeek models in Amazon Bedrock, consider the following aspects:
Amazon Bedrock Custom Model Import empowers organizations to use powerful publicly available models like DeepSeek-R1 distilled versions, among others, while benefiting from enterprise-grade infrastructure. The serverless nature of Amazon Bedrock eliminates the complexity of managing model deployments and operations, allowing teams to focus on building applications rather than infrastructure. With features like auto scaling, pay-per-use pricing, and seamless integration with AWS services, Amazon Bedrock provides a production-ready environment for AI workloads. The combination of DeepSeek’s innovative distillation approach and the Amazon Bedrock managed infrastructure offers an optimal balance of performance, cost, and operational efficiency. Organizations can start with smaller models and scale up as needed, while maintaining full control over their model deployments and benefiting from AWS security and compliance capabilities.
The ability to choose between proprietary and open FMs Amazon Bedrock gives organizations the flexibility to optimize for their specific needs. Open models enable cost-effective deployment with full control over the model artifacts, making them ideal for scenarios where customization, cost optimization, or model transparency are crucial. This flexibility, combined with the Amazon Bedrock unified API and enterprise-grade infrastructure, allows organizations to build resilient AI strategies that can adapt as their requirements evolve.
For more information, refer to the Amazon Bedrock User Guide.
Our next iteration of the FSF sets out stronger security protocols on the path to…
Large neural networks pretrained on web-scale corpora are central to modern machine learning. In this…
Generative AI has revolutionized technology through generating content and solving complex problems. To fully take…
At Google Cloud, we're deeply invested in making AI helpful to organizations everywhere — not…
Advanced Micro Devices reported revenue of $7.658 billion for the fourth quarter, up 24% from…