2 Reliable inference.max 1000x1000 1
AI Hypercomputer is a fully integrated supercomputing architecture for AI workloads – and it’s easier to use than you think. In this blog, we break down four common use cases, including reference architectures and tutorials, representing just a few of the many ways you can use AI Hypercomputer today.
Short on time? Here’s a quick summary.
Affordable inference. JAX, Google Kubernetes Engine (GKE) and NVIDIA Triton Inference Server are a winning combination, especially when you pair them with Spot VMs for up to 90% cost savings. We have several tutorials, like this one on how to serve LLMs like Llama 3.1 405B on GKE.
Large and ultra-low latency training clusters. Hypercompute Cluster gives you physically co-located accelerators, targeted workload placement, advanced maintenance controls to minimize workload disruption, and topology-aware scheduling. You can get started by creating a cluster with GKE or try this pretraining NVIDIA GPU recipe.
High-reliability inference. Pair new cloud load balancing capabilities like custom metrics and service extensions with GKE Autopilot, which includes features like node auto-repair to automatically replace unhealthy nodes, and horizontal pod autoscaling to adjust resources based on application demand.
Easy cluster setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You can get started with one of our AI/ML blueprints.
If you want to see a broader set of reference implementations, benchmarks and recipes, go to the AI Hypercomputer GitHub.
Why it matters
Deploying and managing AI applications is tough. You need to choose the right infrastructure, control costs, and reduce delivery bottlenecks. AI Hypercomputer helps you deploy AI applications quickly, easily, and with more efficiency relative to just buying the raw hardware and chips.
Take Moloco, for example. Using the AI Hypercomputer architecture they achieved 10x faster model training times and reduced costs by 2-4x.
Let’s dive deeper into each use case.
According to Futurum, in 2023 Google had ~3x fewer outage hours vs. Azure, and ~3x fewer than AWS. Those numbers fluctuate over time, but maintaining high availability is a challenge for everyone. The AI Hypercomputer architecture offers fully integrated capabilities for high-reliability inference.
Many customers start with GKE Autopilot because of its 99.95% pod-level uptime SLA. Autopilot enhances reliability by automatically managing nodes (provisioning, scaling, upgrades, repairs) and applying security best practices, freeing you from manual infrastructure tasks. This automation, combined with resource optimization and integrated monitoring, minimizes downtime and helps your applications run smoothly and securely.
There are several configurations available, but in this reference architecture we use TPUs with the JetStream Engine to accelerate inference, plus JAX, GCS Fuse, and SSDs (like Hyperdisk ML) to speed up the loading of model weights. As you can see, there are two notable additions to the stack that get us to high reliability: Service Extensions and custom metrics.
Service extensions allow you to customize the behavior of Cloud Load Balancer by inserting your own code (written as plugins) into the data path, enabling advanced traffic management and manipulation.
Custom metrics, utilizing the Open Request Cost Aggregation (ORCA) protocol, allow applications to send workload-specific performance data (like model serving latency) to Cloud Load Balancer, which then uses this information to make intelligent routing and scaling decisions.
Try it yourself. Start by defining your Load Balancing Metrics, create a plugin using Service Extensions, or spin up a fully-managed Kubernetes cluster with Autopilot. For more ideas, check out this blog on the latest networking enhancements for generative AI applications
Training large AI models demands massive, efficiently scaled compute. Hypercompute Cluster is a supercomputing solution built on AI Hypercomputer that lets you deploy and manage a large number of accelerators as a single unit, using a single API call. Here are a few things that set Hypercompute Cluster apart:
Clusters are densely physically co-located for ultra-low-latency networking. They come with pre-configured and validated templates for reliable and repeatable deployments, and with cluster-level observability, health monitoring, and diagnostic tooling.
To simplify management, Hypercompute Clusters are designed for integrating with orchestrators like GKE and Slurm, and are deployed via the Cluster Toolkit. GKE provides support for over 50,000 TPU chips to train a single ML model.
In this reference architecture, we use GKE Autopilot and A3 Ultra VMs.
GKE supports up to 65,000 nodes — we believe this is more than 10X larger scale than the other two largest public cloud providers.
A3 Ultra uses NVIDIA H200 GPUs with twice the GPU-to-GPU network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega GPUs. They are built with our new Titanium ML network adapter and incorporate NVIDIA ConnectX-7 network interface cards (NICs) to deliver a secure, high-performance cloud experience, perfect for large multi-node workloads on GPUs.
Try it yourself: Create a Hypercompute Cluster with GKE or try this pretraining NVIDIA GPU recipe.
Serving AI, especially large language models (LLMs), can become prohibitively expensive. AI Hypercomputer combines open software, flexible consumption models and a wide range of specialized hardware to minimize costs.
Cost savings are everywhere, if you know where to look. Beyond the tutorials, there are two cost-efficient deployment models you should know. GKE Autopilot reduces the cost of running containers by up to 40% compared to standard GKE by automatically scaling resources based on actual needs, while Spot VMs can save up to 90% on batch or fault-tolerant jobs. You can combine the two to save even more — “Spot Pods” are available in GKE Autopilot to do just that.
In this reference architecture, after training with JAX, we convert into NVIDIA’s Faster Transformer format for inferencing. Optimized models are served via NVIDIA’s Triton on GKE Autopilot. Triton’s multi-model support allows for easy adaptation to evolving model architectures, and a pre-built NeMo container simplifies setup.
Try it yourself: Start by learning how to serve a model with a single NVIDIA GPU in GKE. You can also serve Gemma open models with Hugging Face TGI, or LLMs like DeepSeek-R1 671B and Llama 3.1 405B.
You need tools that simplify, not complicate, your infrastructure setup. The open-source Cluster Toolkit offers pre-built blueprints and modules for rapid, repeatable cluster deployments. You get easy integration with JAX, PyTorch, and Keras. Platform teams get simplified management with Slurm, GKE, and Google Batch, plus flexible consumption models like Dynamic Workload Scheduler and a wide range of hardware options. In this reference architecture, we set up an A3 Ultra cluster with Slurm:
Try it yourself. You can select one of our easy-to-use AI/ML blueprints, available through our GitHub repo, and use it to set up a cluster. We also offer a variety of resources to help you get started, including documentation, quickstarts, and videos.
This tutorial is in two parts; they are: • Using DistilBart for Summarization • Improving…
Those clicks and pops aren't supposed to be there! Give your music a bath with…
Overfitting is one of the most (if not the most!) common problems encountered when building…
Global trade patterns are being redefined. As tariffs reshape international commerce, enterprises face a once-in-a-generation…
This post is co-authored with Sundeep Sardana, Malolan Raman, Joseph Lam, Maitri Shah and Vaibhav…
Yelp found that when it first launched a GPT-4o-powered AI chatbot, usage rates dropped. But…