Categories: FAANG

Accelerate your gen AI: Deploy Llama4 & DeepSeek on AI Hypercomputer with new recipes

The pace of innovation in open-source AI is breathtaking, with models like Meta’s Llama4 and DeepSeek AI’s DeepSeek. However, deploying and optimizing large, powerful models can be complex and resource-intensive. Developers and machine learning (ML) engineers need reproducible, verified recipes that articulate the steps for trying out the models on available accelerators.

Today, we’re excited to announce enhanced support and new, optimized recipes for the latest Llama4 and DeepSeek models, leveraging our cutting-edge AI Hypercomputer platform. AI Hypercomputer helps build a strong AI infrastructure foundation using a set of purpose-built infrastructure components that are designed to work well together for AI workloads like training and inference. It is a systems-level approach that draws from our years of experience serving AI experiences to billions of users, and combines purpose-built hardware, optimized software and frameworks, and flexible consumption models. Our AI Hypercomputer resources repository on GitHub, your hub for these recipes, continues to grow.

In this blog, we’ll show you how to access Llama4 and DeepSeek models today on AI Hypercomputer.

Added support for new Llama4 models

Meta recently released the Scout and Maverick models in the Llama4 herd of models. Llama 4 Scout is a 17 billion active parameter model with 16 experts, and Llama 4 Maverick is a 17 billion active parameter model with 128 experts. These models deliver innovations and optimizations based on a Mixture of Experts (MoE) architecture. They support multimodal capability and long context length.

But serving these models can present challenges in terms of deployment and resource management. To help simplify this process, we’re releasing new recipes for serving Llama4 models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.

JetStream, Google’s throughput and memory-optimized engine for LLM inference on XLA devices, now supports Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E inference on Trillium, the sixth-generation TPU. New recipes now provide the steps to deploy these models using JetStream and MaxText on a Trillium TPU GKE cluster. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. New recipes now demonstrate how to use vLLM to serve the Llama4 Scout and Maverick models on A3 Mega and A3 Ultra GPU GKE clusters.
For serving the Maverick model on TPUs, we utilize Pathways on Google Cloud. Pathways is a system which simplifies large-scale machine learning computations by enabling a single JAX client to orchestrate workloads across multiple large TPU slices. In the context of inference, Pathways enables multi-host serving across multiple TPU slices. Pathways is used internally at Google to train and serve large models like Gemini.
MaxText provides high performance, highly scalable, open-source LLM reference implementations for OSS models written in pure Python/JAX and targeting Google Cloud TPUs and GPUs for training and inference. MaxText now includes reference implementations for Llama4 Scout and Maverick models and includes information on how to perform checkpoint conversion, training, and decoding for Llama4 models.

aside_block: <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e093b2833a0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Added support for DeepSeek Models

Earlier this year, Deepseek released two open-source models: the DeepSeek-V3 model followed by DeepSeek-R1 model. The V3 model provides model innovations and optimizations based on an MoE-based architecture. The R1 model provides reasoning capabilities through the chain-of-thought thinking process.

To help simplify deployment and resource management, we’re releasing new recipes for serving DeepSeek models on Google Cloud Trillium TPUs and A3 Mega and A3 Ultra GPUs.

JetStream now supports DeepSeek-R1-Distill-Llama70B inference on Trillium. A new recipe now provides the steps to deploy DeepSeek-R1-Distill-Llama-70B using JetStream and MaxText on a Trillium TPU VM. With the recent ability to work with Google Cloud TPUs, vLLM users can leverage the performance-cost benefits of TPUs with a few configuration changes. vLLM on TPU now supports all DeepSeek R1 Distilled models on Trillium. Here’s a recipe which demonstrates how to use vLLM, a high-throughput inference engine, to serve the DeepSeek distilled Llama model on Trillium TPUs.
You can also deploy DeepSeek Models using the SGLang Inference stack on our A3 Ultra VMs powered by eight NVIDIA H200 GPUs with this recipe. A recipe for A3 Mega VMs with SGLang is also available, which shows you how to deploy multihost inference utilizing two A3 Mega nodes. Cloud GPU users using the vLLM Inference engine can also deploy DeepSeek Models on the A3 Mega (recipe) and A3 Ultra (recipe) VMs.
MaxText now also includes support for architectural innovations from DeepSeek such as MLA – Multi-Head Latent Attention, MoE Shared and Routed Experts with Loss Free Load Balancing, Expert Parallelism support with Dropless, Mixed Decoder Layers ( Dense and MoE ) and YARN RoPE embeddings. The reference implementations for the DeepSeek family of models allows you to rapidly experiment with your models by incorporating some of these newer architectural enhancements.

Recipe example

The reproducible recipes show the steps to deploy and benchmark inference with the new Llama4 and DeepSeek models. For example, this TPU recipe outlines the steps to deploy the Llama-4-Scout-17B-16E Model with JetStream MaxText Engine with Trillium TPU. The recipe shows steps to provision the TPU cluster, download the model weights and set up JetStream and MaxText. It then shows you how to convert the checkpoint to a compatible format for MaxText, deploy it on a JetStream server, and run your benchmarks.

Typical recipe outline :

Download model weights from HuggingFace
Convert the checkpoint from Hugging Face format to JAX Orbax format
Unscan checkpoint for performant serving

Deploy JetStream and Pathways (for multihost serving)

Run MMLU benchmark

Bring up the Llama4 server with the JetStream Engine with the following config:

code_block: <ListValue: [StructValue([(‘code’, ‘python3 -m MaxText.maxengine_server \rn /maxtext/MaxText/configs/base.yml \rn scan_layers=false \rn model_name=llama4-17b-16e \rn weight_dtype=bfloat16 \rn base_output_directory=$BASE_OUTPUT_PATH \rn run_name=serving-run \rn load_parameters_path=$CHECKPOINT_TPU_UNSCANNED \rn sparse_matmul=false \rn ici_tensor_parallelism=8 \rn max_prefill_predict_length=1024 \rn force_unroll=false \rn max_target_length=2048 \rn hf_access_token=$HF_TOKEN’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e093b777340>)])]>

Run various benchmarks on this server. Eg: To run MMLU, use the JetStream benchmarking script like this:

code_block: <ListValue: [StructValue([(‘code’, ‘JAX_PLATFORMS=tpu python3 /JetStream/benchmarks/benchmark_serving.py \rn –tokenizer meta-llama/Llama-4-Scout-17B-16E \rn –use-hf-tokenizer 1 \rn –hf-access-token $HF_TOKEN \rn –num-prompts 14037 \rn –dataset mmlu \rn –dataset-path $MMLU_DATASET_PATH \rn –request-rate 0 \rn –warmup-mode sampled \rn –save-request-outputs \rn –num-shots=5 \rn –run-eval True \rn –model=llama4-17b-16e \rn –save-result \rn –request-outputs-file-path mmlu_outputs.json’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e093b777370>)])]>

Build with us

You can deploy Llama4 Scout and Maverick models or DeepSeekV3/R1 models today using inference recipes from the AI Hypercomputer Github repository. These recipes provide a starting point for deploying and experimenting with Llama4 models on Google Cloud. Explore the recipes and resources linked below, and stay tuned for future updates. We hope you have fun building and share your feedback!

When you deploy open models like DeepSeek and Llama, you are responsible for its security and legal compliance. You should follow responsible AI best practices, adhere to the model’s specific licensing terms, and ensure your deployment is secure and compliant with all regulations in your area.

Model	Accelerator	Framework	Inference Recipe link
Llama-4-Scout-17B-16E	Trillium (TPU v6e)	JetStream Maxtext	Recipe
Llama-4-Maverick-17B-128E	Trillium (TPU v6e)	JetStream Maxtext + Pathways on Cloud	Recipe
Llama-4-Scout-17B-16E Llama-4-Scout-17B-16E-Instruct Llama-4-Maverick-17B-128E Llama-4-Maverick-17B-128E-Instruct	A3 Ultra (8xH200)	vLLM	Recipe
	A3 Mega (8xH100)	vLLM	Recipe

Model	Accelerator	Framework	Inference Recipe link
DeepSeek-R1-Distill-Llama-70B	Trillium (TPU v6e)	JetStream Maxtext	TPU-VM recipe GKE + TPU recipe
DeepSeek-R1-Distill-Llama-70B	Trillium (TPU v6e)	vLLM	Recipe
DeepSeek R1 671B	A3 Ultra (8xH200)	vLLM	Recipe
DeepSeek R1 671B	A3 Ultra (8xH200)	SGLang	Recipe
DeepSeek R1 671B	A3 Mega (16xH100)	vLLM	Recipe
DeepSeek R1 671B	A3 Mega (16xH100)	SGLang	Recipe