Categories: FAANG

What’s new with HPC and AI infrastructure at Google Cloud

At Google Cloud, we’re rapidly advancing our high-performance computing (HPC) capabilities, providing researchers and engineers with powerful tools and infrastructure to tackle the most demanding computational challenges. Here’s a look at some of the key developments driving HPC innovation on Google Cloud, as well as our presence at Supercomputing 2024.

You can also stay apprised of our HPC and AI advances by joining the new Google Cloud Advanced Computing Community (details below). 

Next-generation HPC VMs

We began our H-series with H3 VMs, specifically designed to meet the needs of demanding HPC workloads. Now, we’re excited to share some key features of the next generation of the H family, bringing even more innovation and performance to the table. The upcoming VMs will feature:

  • Improved workload scalability via RDMA-enabled 200 Gbps networking

  • Native support to directly provision full, tightly-coupled HPC clusters on demand 

  • Dynamic Workload Scheduler to provision fixed-lifetime clusters now or in the future

  • Titanium technology that delivers superior performance, reliability, and security 

We provide system blueprints for setting up turnkey, pre-configured HPC clusters on our H series VMs.

The next generation of H series is coming in early 2025.

aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3d96a7e670>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Parallelstore: World’s first fully-managed DAOS offering

Parallelstore is a fully managed, scalable, high-performance storage solution based on next-generation DAOS technology, designed for demanding HPC and AI workloads. It is now generally available and provides:

  • Up to 6x greater read throughput performance compared to competitive Lustre scratch offerings

  • Low latency (<0.5ms at p50) and high throughput (>1GiB/s per TiB) to access data with minimal delays, even at massive scale

  • High IOPS (30K IOPS per TiB) for metadata operations

  • Simplified management that reduces operational overhead with a fully managed service  

Parallelstore is great for applications requiring fast access to large datasets, such as:

  • Analyzing massive genomic datasets for personalized medicine

  • Training large language models (LLMs) and other AI applications efficiently  

  • Running complex HPC simulations with rapid data access

A3 Ultra VMs with NVIDIA H200 Tensor Core GPUs

For GPU-based HPC workloads, we recently announced A3 Ultra VMs, which feature NVIDIA H200 Tensor Core GPUs. A3 Ultra VMs offer a significant leap in performance over previous generations. They are built on servers with our new Titanium ML network adapter, optimized to deliver a secure, high-performance cloud experience for AI workloads, and powered by NVIDIA ConnectX-7 networking. Combined with our datacenter-wide 4-way rail-aligned network, A3 Ultra VMs deliver non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). 

Compared with A3 Mega, A3 Ultra offers: 

  • 2x the GPU-to-GPU networking bandwidth, powered by Google Cloud’s Titanium ML network adapter and backed by our Jupiter data center network

  • Up to 2x higher LLM inferencing performance with nearly double the memory capacity and 1.4x more memory bandwidth

  • Ability to scale to tens of thousands of GPUs in a dense, performance-optimized cluster for large AI and HPC workloads

With system blueprints, available through Cluster Toolkit, customers can quickly and easily create turnkey, pre-configured HPC clusters with Slurm support on A3 VMs.

A3 Ultra VMs will also be available through Google Kubernetes Engine (GKE), which provides an open, portable, extensible, and highly-scalable platform for large-scale training and serving of AI workloads.

Trillium: Ushering in a new era of TPU performance for AI

Tensor Processing Units, or TPUs, power our most advanced AI models such as Gemini, popular Google services like Search, Photos, and Maps, as well as scientific breakthroughs like AlphaFold 2 — which led to a Nobel Prize this year!

We recently announced that Trillium, our sixth-generation TPU, is available to Google Cloud customers in preview. 

Compared with TPU v5e, Trillium delivers: 

  • Over 4x improvement in training performance 

  • Up to 3x increase in inference throughput 

  • 67% increase in energy efficiency

  • 4.7x increase in peak compute performance per chip 

  • Double the high bandwidth memory capacity 

  • Double the interchip interconnect bandwidth 

Cluster Toolkit: Streamlining HPC deployments

We continue to improve Cluster Toolkit, providing open-source tools for deploying and managing HPC environments on Google Cloud. Recent updates include:

GKE: Container orchestration with scale and performance

GKE continues to lead the way for containerized workloads with the support of the largest Kubernetes clusters in the industry. With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.

At the same time, we continue to invest in automating and simplifying the building of HPC and AI platforms, with:

Customer success stories: Atommap and beyond

Atommap, a company specializing in atomic-scale materials design, is using Google Cloud HPC to accelerate its research and development efforts. With H3 VMs and Parallelstore, Atommap has achieved:  

  • Significant speedup in simulations: Reduced time-to-results by more than half, enabling faster innovation 

  • Improved scalability: Easily scaled resources for 1,000s to 10,000s of molecular simulations, to meet growing computational demands 

  • Better cost-effectiveness: Optimized infrastructure costs, with savings of up to 80%, while achieving high performance 

Atommap’s success story highlights the transformative potential of Google Cloud HPC for organizations pushing the boundaries of scientific discovery and technological advancement.

Looking ahead

Google Cloud is committed to continuous innovation for HPC. Expect further enhancements to HPC VMs, Parallelstore, Cluster Toolkit, Slurm-gcp, and other HPC products and solutions. With a focus on performance, scalability, compatibility, and ease of use, we’re empowering researchers and engineers to tackle the world’s most complex computational challenges.

Google Cloud Advanced Computing Community

We’re excited to announce the launch of the Google Cloud Advanced Computing Community, a new kind of community of practice for sharing and growing HPC, AI, and quantum computing expertise, innovation, and impact.

This community of practice will bring together thought leaders and experts from Google, its partners, and HPC, AI, and quantum computing organizations around the world for engaging presentations and panels on innovative technologies and their applications. The Community will also leverage Google’s powerful, comprehensive, and cloud-native tools to create an interactive, dynamic, and engaging forum for discussion and collaboration.

The Community launches now, with meetings starting in December 2024 and a full rollout of learning and collaboration resources in early 2025. To learn more, register here

Google Cloud at Supercomputing 2024

The annual Supercomputing Conference series brings together the global HPC community to showcase the latest advancements in HPC, networking, storage and data analysis. Google Cloud is excited to return to Supercomputing 2024 in Atlanta with our largest presence ever. 

Visit Google Cloud at booth #1730 to jump in and learn about our HPC, AI infrastructure, and quantum solutions. The booth will feature a Trillium TPU board, NVIDIA H200 GPU and ConnectX-7 NIC, hands-on labs, a full schedule of talks, a comfortable lounge space, and plenty of great swag!

The booth theater will include talks from ARM, Altair, Ansys, Intel, NAG, SchedMD, Siemens, Sycomp, Weka, and more. Booth labs will get you deploying Slurm clusters to fine-tune the Llama2 model or run GROMACS using Cloud Batch to run microbenchmarks or quantum simulations, and more.

We’re also involved in several parts of SC24’s technical program, including BoFs, User Groups, and Workshops. Googlers will participate in the following technical sessions: 

Google is also hosting or sponsoring the following exciting events during SC24. We’re looking forward to seeing you there!

Finally, we’ll be holding private meetings and roadmap briefings with our HPC leadership throughout the conference. To schedule a meeting, please contact hpc-sales@google.com.

AI Generated Robotic Content

Recent Posts

Can “Safe AI” Companies Survive in an Unrestrained AI Landscape?

TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…

5 hours ago

Large language overkill: How SLMs can beat their bigger, resource-intensive cousins

Whether a company begins with a proof-of-concept or live deployment, they should start small, test…

6 hours ago

14 Best Planners: Weekly and Daily Notebooks & Accessories (2024)

Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…

6 hours ago

5 Tools for Visualizing Machine Learning Models

Machine learning (ML) models are built upon data.

1 day ago

AI Systems Governance through the Palantir Platform

Editor’s note: This is the second post in a series that explores a range of…

1 day ago

Introducing Configurable Metaflow

David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…

1 day ago