image1 BoPNyGA.max 1000x1000 1
When it comes to AI, large language models (LLMs) and machine learning (ML) are taking entire industries to the next level. But with larger models and datasets, developers need distributed environments that span multiple AI accelerators (e.g. GPUs and TPUs) across multiple compute hosts to train their models efficiently. This can lead to orchestration, resource management, and scalability challenges.
We’re here to help. At Google Cloud, we provide a robust suite of GPU and TPU resources alongside advanced orchestration tools as part of AI Hypercomputer architecture to simplify distributed, large-scale training. In this blog, we’ll guide you through the orchestration tools available for GPU accelerators on Google Cloud that can help you streamline and scale your machine learning workflows.
A key element of distributed training lies in selecting the right GPU. Google Cloud’s specialized machine families offer tailored solutions for varying needs of performance and cost efficiency. The A3 machine series, featuring NVIDIA H100 and NVIDIA H200 (upcoming) GPUs, delivers strong GPU-to-GPU bandwidth that’s a great fit for large-scale training workloads. In contrast, the A2 machine series with NVIDIA A100 GPUs is designed for scenarios that require minimal inter-node communication such as streamlined, single-node training. Additionally, the versatile G2 machine family, equipped with NVIDIA L4 GPUs, provides the flexibility necessary for inference and testing workloads.
We also offer multiple GPU consumption models to meet the needs of large-scale training:
Committed Use Discounts (CUDs) provide significant cost savings and guaranteed capacity in return for a long-term commitment.
Dynamic Workload Scheduler (DWS) comes in two modes, which are designed for various workloads that need assurance or can be flexible about start time; the capacity is available for a defined duration and offered at a lower list price.
On-demand consumption is the most flexible, with no upfront commitments, although the capacity availability is not guaranteed.
Spot VMs provide drastically lower costs but are preemptible, requiring resilient and disruption-tolerant job designs.
To further accelerate your distributed training, we’ll explore three powerful orchestration strategies on Google Cloud: Google Kubernetes Engine (GKE), Cluster Toolkit, and Vertex AI custom training pipeline. Each approach brings its unique strengths, enabling you to leverage Google Cloud’s powerful infrastructure to drive your machine learning projects forward quickly and scalably.
Let’s walk through each of the options to better understand how Google Cloud’s advanced orchestration tools can help you optimize resources, reduce complexity, and achieve strong performance in your ML initiatives.
Enterprises with robust platform teams often want a unified environment on which to run all their workloads, including custom training, for simpler management. GKE is a good choice in this context, providing the flexibility and scalability to handle diverse workloads on a single platform. With GKE, platform teams gain centralized control and visibility, while optimizing resource utilization and streamlining management.
Here’s how to orchestrate ML workloads running on GKE:
1. GKE cluster and nodepool provisioning
If you have reservation (CUD or DWS calendar) and prefer to use Terraform, follow the instructions from cluster provisioning templates, and specify the parameter file (terraform.tfvars):
Then execute the following command to provision the GKE cluster and nodepool:
In addition, Cluster Toolkit includes terraform based example blueprints to provision A3 or A3 Mega GKE clusters and nodepool.
If you prefer to use the gcloud command, follow the step-by-step instructions from this tutorial to create a GKE cluster and nodepool with A3/A3 Mega VMs. 
For DWS Flex, you can create a DWS enabled node-pool with these gcloud commands:
2. Enable GPU direct communication with A3 TCPX/A3 Mega TCPXO and perform an initial benchmark test
Follow these steps to install GPUDirect for TCPX/TCPXO libraries, configure NCCL, and deploy a test workload to perform your initial benchmark tests.
Validate the output of allgather benchmark tests for two A3 Mega nodes:
        
In the above benchmark output table, the first column is message size, while the algbw and busbw columns on the right indicate per GPU bandwidth. Usually, we use the in/out-of-place busbw column with the biggest message size (highlighted row) to determine cross-node bandwidth. For A3 Mega nodes, we expect a range of 185-190GB/s per GPU; this may indicate near cross-node 1600gbps network bandwidth for A3 Mega nodes with 8 NVIDIA H100 GPUs and 8 NICs.
You may expand the NCCL tests from two nodes to 8, 16, 32, etc. to ensure your cross-node network performance is within a decent range and that all the nodes are healthy.
3. Configure distributed training batch workload
You can use JobSet, a Kubernetes-native API for managing a group of k8s Jobs as a unit using a unified API, to deploy distributed HPC (e.g., MPI) and AI/ML training workloads (PyTorch, Jax, Tensorflow etc.) on Kubernetes.
The following example illustrates a JobSet yaml manifest for A3 with GPUDirect-TCPX, which includes:
Key JobSet configuration elements
b. Training job settings, including pytorch main container
c. gcsfuse, tcpx (A3 high), tcpxo (A3 Mega) RxDM container
d. NCCL environment variables
For DWS batch workloads, please refer to the following A3 Mega-based example, integrating Kueue and JobSet settings.
Lastly, refer to this Helmchart example to see how to perform Megatron LM (Llama2) training on A3 Mega.
Slurm is one of the most popular high-performance computing (HPC) job schedulers. Used by researchers in both academia and industry, it offers a robust solution for LLM training orchestration with familiar semantics. Support for Slurm on Google Cloud is provided by Cluster Toolkit, formerly known as Cloud HPC Toolkit, open-source software that simplifies the process of deploying HPC, AI, and ML workloads on Google Cloud. It is designed to be highly customizable and extensible, and to address the deployment needs of a broad range of use cases, including deploying infrastructure for large-scale LLM training.
1. Provisioning A3-high and A3-mega clusters
Install Cluster Toolkit using the configuration instructions in the public documentation. Be sure to note some of the prerequisites including supported versions of Go, Terraform, and Packer.
Once you have a working Cluster Toolkit installation including the downloaded github repository, navigate to the examples/machine-learning blueprints directory. Here, you will have two folders for deploying H100 clusters based on the A3-series machine shapes, a3-highgpu-8g and a3-megagpu-8g. In this example, we’ll explore the blueprint in the a3-megagpu-8g folder.
Google Cloud Cluster Toolkit blueprints are Infrastructure as Code (IaC) documents that describe the infrastructure you would like to deploy, and are conceptually similar to Terraform or other IaC tooling. For the a3-megagpu-8g blueprint, there are three main files that control the deployment: 
slurm-a3mega-base.yaml – includes creating the necessary VPC networks along with the filestore instance used for a common home filesystem on the cluster nodes.slurm-a3mega-image.yaml – creates the Compute Engine image instance that is used by Slurm to provision nodes based on the cluster’s definitioslurm-a3mega-cluster.yaml – sets up the main cluster components, including the Slurm controller (the main orchestrator for Slurm jobs), the Slurm login node (a host used for job submission) and the a3mega partition (the working nodes in the cluster)While you can customize each of the blueprint components if needed, you can easily get started by simply specifying the details for your working environment in the deployment-base.yaml and the deployment-image-cluster.yaml.
2. Enable GPUDirect-TCPXO optimized NCCL communication
Once the Slurm cluster is created, follow this tutorial to enable GPUDirect-TCPXO for optimized NCCL communication on the GPU networks. To validate the environment and ensure the TCPXO plugin is being properly loaded, build and compile the NCCL tests. Then, run sbatch run-nccl-tests.sh from the login node, being sure to change the number of nodes in the script to match those in your cluster. This runs a distributed all_gather test across the GPUs and nodes indicated in the script.
When working as intended, the NCCL tests should show output results indicating high-speed bandwidth throughput at various message sizes. A common measure of performance is to use the busbw value in GB/s from the second or last row of the output table, which shows the 4Gb and 8Gb message size values. A cluster with TCPXO active should report around 190 GB/s busbw throughput. See the performance page in the NVIDIA NCCL-tests repository for more details around these metrics.
3. Run an NeMo training workload
Follow this NeMo training tutorial to run an example NeMo Framework Pre-Training job using the following steps:
Step 1:
Creates a NeMo Framework-derived container with the necessary TCPXO environment variables
Submits a Slurm job to copy the framework launcher scripts and a few other auxiliary files into your working directory
Step 2:
Establishes a Python virtual environment and installs NeMo Framework python package dependencies
Step 3:
This command runs distributed training of a 5B parameter GPT3 model across eight nodes for 10 steps using mock data as the input.
For teams seeking managed infrastructure experience as well as access to leading open models such as Llama 3.1, Mistral, etc., Vertex AI Model Garden and Custom Training Job service presents an attractive option. This fully managed solution removes most of the orchestration burden and provides end-to-end ML platform operations, allowing you to focus on model development and experimentation. Vertex AI’s end-to-end training support further simplifies the process, offering an integrated workflow from data preparation to deployment.
Let’s look at how to perform single-node or multi-node fine-tuning/training workload on Vertex.
Single-node multi-GPU fine-tuning/training on Vertex
This notebook demonstrates fine-tuning and deploying Llama 3.1 models with the Vertex AI SDK. All of the examples in this notebook use parameter-efficient finetuning (PEFT) methods with Low-Rank Adaptation (LoRA) to reduce training and storage costs. LoRA is one approach of PEFT, where pretrained model weights are frozen and rank decomposition matrices representing the change in model weights are trained during fine-tuning.
This Vertex sample training repo provides examples on how to launch multi-node distributed training on A3 Mega (8 x NVIDIA H100) on Vertex.
The NeMo example illustrates how to perform pre-training, continued pre-training and supervised fine-tuning (SFT). In addition, NeMo allows optimized training as a popular approach to evaluate the AI accelerator (A3 Mega in this case). To benchmark, you can rely on the reported metrics such as epoch time, step-time, etc. Since NeMo runs on most NVIDIA GPU types, it can be helpful for comparing different AI chips for a given task. Read on to learn how to run the example on Vertex with A3 Mega node types.
launch.sh is the main entry point to launch NeMo distributed training with command parameters:
Example:
At the end of launch.sh script, we use curl command to call Vertex customJobs API to launch NeMo training job in Vertex:
Job configurations in vertex-payload.json are part of curl command to launch Nemo training, it includes job specifications on resource requirements as showed:
The job configuration arguments “${TRANSFER_MODEL_CMD} ${LAUNCH_CMD}” in turn embed full content from the job training script, which also includes all the NCCL environments required by A3 Mega, while other pytorch launch commands are executed by Vertex CustomJob.
Optionally, build your own custom job container image as an “imageUri” parameter in vertex-payload.json, using this Dockerfile as your reference.
Lastly, we recognize many organizations prefer a more hands-on approach and have specific orchestration tools or frameworks that they wish to use. If that describes you, Google Compute Engine provides the foundation to build your own tailored training environments, letting you create and configure virtual machines (VMs) with your desired specifications, including the type and number of GPUs, CPU, memory, and storage. This granular control lets you optimize your infrastructure for your specific training workloads and integrate your preferred orchestration tools.
To facilitate this process, we provide example code snippets demonstrating how to use the gcloud compute instance create and gcloud compute instance bulk create API calls to create and manage your vanilla A3 Mega instances. Whether you need to create a single VM or provision a large-scale cluster, these resources can help streamline your infrastructure setup.
With the right orchestration strategy and Google Cloud’s robust and leading AI infrastructure, you can achieve your training goals and transform your business objectives into reality.
To learn more about distributed training, please review GKE example, Cluster Toolkit example, and Vertex AI example.
📦 : https://github.com/lovisdotio/workflow-magnify-upscale-video-comfyui-lovis I did this ComfyUI workflow for Sora 2 upscaling 🚀 ( or…
Python's flexibility with data types is convenient when coding, but it can lead to runtime…
We revisit scene-level 3D object detection as the output of an object-centric framework capable of…
Editor’s Note: This is the second in a two-part series highlighting demo sessions from AIPCon…
Generative AI has emerged as a transformative technology in healthcare, driving digital transformation in essential…
The conversation around generative AI in the enterprise is getting creative. Since launching our popular…