1 lTFF8Zumax 1000x1000 1
As generative AI becomes more widespread, it’s important for developers and ML engineers to be able to easily configure infrastructure that supports efficient AI inference, i.e., using a trained AI model to make predictions or decisions based on new, unseen data. While great at training models, traditional GPU-based serving architectures struggle with the “multi-turn” nature of inference, characterized by back-and-forth conversations where the model must maintain context and understand user intent. Further, deploying large generative AI models can be both complex and resource-intensive.
At Google Cloud, we’re committed to providing customers with the best choices for their AI needs. That’s why we are excited to announce a new recipe for disaggregated inferencing with NVIDIA Dynamo, a high-performance, low-latency platform for a variety of AI models. Disaggregated inference separates out model processing phases, offering a significant leap in performance and cost-efficiency.
Specifically, this recipe makes it easy to deploy NVIDIA Dynamo on Google Cloud’s AI Hypercomputer, including Google Kubernetes Engine (GKE), vLLM inference engine, and A3 Ultra GPU-accelerated instances powered by NVIDIA H200 GPUs. By running the recipe on Google Cloud, you can achieve higher performance and greater inference efficiency while meeting your AI applications’ latency requirements. You can find this recipe, along with other resources, in our growing AI Hypercomputer resources repository on GitHub.
Let’s take a look at how to deploy it.
LLM inference is not a monolithic task; it’s a tale of two distinct computational phases. First is the prefill (or context) phase, where the input prompt is processed. Because this stage is compute-bound, it benefits from access to massive parallel processing power. Following prefill is the decode (or generation) phase, which generates a response, token by token, in an autoregressive loop. This stage is bound by memory bandwidth, requiring extremely fast access to the model’s weights and the KV cache.
In traditional architectures, these two phases run on the same GPU, creating resource contention. A long, compute-heavy prefill can block the rapid, iterative decode steps, leading to poor GPU utilization, higher inference costs, and increased latency for all users.
Our new solution tackles this challenge head-on by disaggregating, or physically separating, the prefill and decode stages across distinct, independently managed GPU pools.
Here’s how the components work in concert:
A3 Ultra instances and GKE: The recipe uses GKE to orchestrate separate node pools of A3 Ultra instances, powered by NVIDIA H200 GPUs. This creates specialized resource pools — one optimized for compute-heavy prefill tasks and another for memory-bound decode tasks.
NVIDIA Dynamo: Acting as the inference server, NVIDIA Dynamo’s modular front end and KV cache-aware router processes incoming requests. It then pairs GPUs from the prefill and decode GKE node pools and orchestrates workload execution between them, transferring the KV cache that’s generated in the prefill pool to the decode pool to begin token generation.
vLLM: Running on pods within each GKE pool, the vLLM inference engine helps ensure best-in-class performance for the actual computation, using innovations like PagedAttention to maximize throughput on each individual node.
This disaggregated approach allows each phase to scale independently based on real-time demand, helping to ensure that compute-intensive prompt processing doesn’t interfere with fast token generation. Dynamo supports popular inference engines including SGLang, TensorRT-LLM and vLLM. The result is a dramatic boost in overall throughput and maximized utilization of every GPU.
        
The reproducible recipe shows the steps to deploy disaggregated inference with NVIDIA Dynamo on the A3 Ultra (H200) VMs on Google Cloud using GKE for orchestration and vLLM as the inference engine. The single node recipe demonstrates disaggregated inference with one node of A3 Ultra using four GPUs for prefill and four GPUs for decode. The multi-node recipe demonstrates disaggregated inference with one node of A3 Ultra for prefill and one node of A3 Ultra for decode for the Llama-3.3-70B-Instruct Model.
Future recipes will provide support for additional NVIDIA GPUs (e.g. A4, A4X) and inference engines with expanded coverage of models.
The recipe highlights the following key steps:
Perform initial setup – This sets up environment variables and secrets; this needs to be done one-time only.
Install Dynamo Platform and CRDs – This sets up the various Dynamo Kubernetes components; this needs to be done one-time only.
Deploy inference backend for a specific model workload – This deploys vLLM/SGLang as the inference backend for Dynamo disaggregated inference for a specific model workload. Repeat this step for every new model inference workload deployment.
Process inference requests – Once the model is deployed for inference, incoming queries are processed to provide responses to users.
Once the server is up, you will see the prefill and decode workers along with the frontend pod which acts as the primary interface to serve the requests.
        
We can verify if everything works as intended by sending a request to the server like this. The response is generated and truncated to max_tokens.
By moving beyond the constraints of traditional serving, the new disaggregated inference recipe represents the future of efficient, scalable LLM inference. It enables you to right-size resources for each specific task, unlocking new performance paradigms and significant cost savings for your most demanding generative AI applications. We are excited to see how you will leverage this recipe to build the next wave of AI-powered services. We encourage you to try out our Dynamo Disaggregated Inference Recipe which provides a starting point with recommended configurations and easy steps. We hope you have fun experimenting and share your feedback!
submitted by /u/nikitagent [link] [comments]
A dangerous assumption that can be made from prior work on the bias transfer hypothesis…
Author: Keertana Chidambaram, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya(*The work was done when Keertana interned…
Remember when browsers were simple? You clicked a link, a page loaded, maybe you filled…
These affordable open buds come with Bose-crafted sound.
Over the past decade, deep learning has transformed how artificial intelligence (AI) agents perceive and…