Categories: FAANG

Cloud Storage FUSE is now optimized for GKE and AI workloads

Google Cloud’s Cloud Storage is home to reams of training data, models and checkpoints that you need to train and serve AI workloads, delivering the scale, performance, simplicity and cost-effectiveness that are the hallmarks of a cloud storage system. But when it comes time for an AI workload to actually access that data, it isn’t always straightforward, since most AI workloads require file system semantics, rather than the object semantics that Cloud Storage provides.

Linux’s Filesystem in Userspace, or FUSE, is an interface used to export a file system to the Linux kernel. An open-source version of Cloud Storage FUSE, has been available for some time, allowing objects in Cloud Storage buckets to be accessed as files mounted as a local file system. Today we are taking an important next step: delivering Cloud Storage FUSE as a first-party Google Cloud offering, with new levels of portability, reliability, performance and integration.

The new Cloud Storage FUSE is particularly important for AI workloads. Because applications can access data directly (rather than downloading it locally), there’s no custom logic to implement, and less idle time for valuable resources like TPUs and GPUs while the data is copied over. Further, a new Cloud Storage FUSE CSI driver for Google Kubernetes Engine (GKE) allows applications to mount Cloud Storage using familiar Kubernetes API, and it’s offered as a turn-key deployment managed by GKE.

Let’s take a closer look at each at how the new first-party Cloud Storage FUSE delivers increased portability, reliability, performance, and integration. 

Portability

Cloud Storage is a common choice for AI/ML workloads because of its unlimited scale, simplicity, affordability, and performance. But while some AI/ML frameworks have libraries that support native object-storage APIs directly, others require file system semantics; or sometimes, the organization has standardized on file system semantics for a consistent experience across hybrid and multicloud environments. To overcome this, developers have to instrument their training code with logic to first copy the training data from Cloud Storage to a local disk. 

With Cloud Storage FUSE, objects in Cloud Storage buckets can be accessed as files mounted as a local file system, providing file system semantics while being able to continue using Cloud Storage. 

AnyConcept develops a new form of functional, no-code software tests using Deep Reinforcement Learning agents, and uses Cloud Storage FUSE as part of its dataset pipeline for its AI models, first as part of pre-processing data using Jupyter Notebooks, and after for training its AI models.

“Since we need file system semantics and the data is too large to be copied to our TPU based training VMs, Cloud Storage FUSE allows us to access this data directly, giving us unlimited space with the convenience of a file system while maintaining cost-efficiency.” – Manuel Weichselbaum, CTO, AnyConcept GmbH

Reliability

By making Cloud Storage FUSE a Google-supported product, we aimed to achieve Google standards for reliability, so you can run production workloads with full support. During testing, we uncovered and fixed several stability issues with the original code base. We also integrated Cloud Storage FUSE with the official Go Cloud Storage client library, and validated Cloud Storage FUSE for PyTorch and TensorFlow at high scale and long duration using ViT DINO and ResNet ML models. As part of our production readiness, we also overhauled the documentation to make it easier to use. 

Global credit reporting firm Equifax hosts its Equifax Ignite® platform on Google Cloud, which customers use to apply high-end machine learning capabilities on Equifax data for predictive models and insights. 

“Integration of Jupyter Notebook, installed on Google Kubernetes Engine (GKE), with Google’s Cloud Storage is a core component of Equifax Ignite, and the Cloud Storage FUSE integration with GKE through the CSI drive made it seamless and easy to use. We are pleased that it is now available as part of the Equifax Ignite fully cloud native service offering.” – Vibhu Prakash, Vice President, Analytics Platform, Equifax

Performance

AI/ML workloads typically use accelerators, in the form of GPUs and TPUs, for training and inference workflows. These accelerators are data-hungry, and keeping them idle while they wait for I/O only increases the cost of using them. For applications that need to consume Cloud Storage data via a file system, developers typically implement complex logic to first copy data from Cloud Storage to a local disk, resulting in idle time for the compute resources as they wait for objects to download. With Cloud Storage FUSE, developers can treat a Cloud Storage bucket as a local file system, and stream data directly to the application as if it were local. You can see published benchmarks here.

OpenX is a global adtech company that runs its exchange in Google Cloud, where it processes hundreds of billions of ad requests daily. It previously relied on a home-grown solution to fetch data files from Cloud Storage into init containers at pod start-up, and to periodically refresh the data in running pods. 

“With the Cloud Storage Fuse GKE integration, all of that goes away; all it takes is a simple annotation in the pod spec and a volume definition to make the data available. Using Cloud Storage FUSE with the GKE CSI driver has resulted not only in vastly simplified configuration for our applications, but has also reduced the pod startup time by up to 40%.” – Mark Chodos, Staff Engineer, OpenX

Integration

You can deploy Cloud Storage FUSE in your own environment in a variety of ways:

Pathology researcher Reveal Biosciences pinpoints and categorizes diseases and leverages machine learning to refine its models for superior accuracy, resulting in improved patient prognosis. 

“A pivotal asset in our journey has been Cloud Storage FUSE. Our data is stored in a Cloud Storage bucket, but because our application needs to access these files using file-system semantics, we used to have to download the data locally first. This remarkable tool now enables us to process terabytes of data without needing to manage locally attached storage to VMs or Kubernetes clusters, giving us an efficient alternative with scalable capacity that is not tied to local compute. Google has been instrumental in propelling us towards our performance goals, providing invaluable support.” – Bharat Jangir, MLops Engineer, Reveal Biosciences 

Using the Cloud Storage Fuse CSI driver on GKE

Previous Fuse solutions with Kubernetes required elevated privileges, had noisy neighbor issues , and authentication challenges. The new Cloud Storage FUSE CSI driver does not need privileged access, is fully managed by the CSI lifecycle, and has built-in authentication with Workload Identity, all while allowing Kubernetes pods to access data in Cloud Storage buckets using file-system semantics.

“The integration with GKE through the sidecar container injection makes the entire onboarding process easy to manage. It is very easy to authenticate, and the service account access makes it possible to only allow required pods to access required Cloud Storage objects/folders. Therefore, we are able to manage the access control easily via IAM.” – Uğur Arpaci, Lead DevOps Engineer, Codeway

There are two ways to provision Cloud Storage-backed volumes:  

  • Using ephemeral volumes, where you simply specify your Cloud Storage bucket and authentication information in the pod spec. We recommend this approach for its simplicity.

  • Using static provisioning with PersistentVolumes and PersistentVolumeClaim. This approach is recommended if compatibility with the traditional ways of accessing storage on Kubernetes is important for your organization.

The Cloud storage FUSE CSI is supported on both GKE Standard and GKE Autopilot starting with GKE version 1.26 with a plan to backport to earlier versions in subsequent releases. 

Additionally, Terraform templates are also available. If you are using ephemeral volumes, you simply need to specify your Cloud Storage object along with your Cloud Storage bucket and authentication information and the data will be available to your pod. Learn more here. Example pod spec below.

code_block
[StructValue([(u’code’, u’apiVersion: v1rnkind: Podrnmetadata:rn name: gcs-fuse-csi-example-ephemeralrn namespace: my_namespacern annotations:rn gke-gcsfuse/volumes: “true”rnspec:rn terminationGracePeriodSeconds: 60rn securityContext:rn runAsUser: 1001rn runAsGroup: 2002rn fsGroup: 3003rn containers:rn – image: busyboxrn name: busyboxrn command: [“sleep”]rn args: [“infinity”]rn volumeMounts:rn – name: gcs-fuse-csi-ephemeralrn mountPath: /datarn readOnly: truern serviceAccountName: my_k8s_sarn volumes:rn – name: gcs-fuse-csi-ephemeralrn csi:rn driver: gcsfuse.csi.storage.gke.iorn readOnly: truern volumeAttributes:rn bucketName: my-bucket-namern mountOptions: “implicit-dirs,uid=1001,gid=3003″rnrn – name: gcs-fuse-csi-ephemeralrn csi:rn driver: gcsfuse.csi.storage.gke.iorn volumeAttributes:rn bucketName: my-bucket-namern restartPolicy: Neverrn backoffLimit: 1′), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ef44ecf1e50>)])]

File system semantics, without the fuss

With Cloud Storage FUSE, you can continue to use Cloud Storage as your source of truth for your AI/ML workloads, without sacrificing file-system semantics, wasting valuable resources, or having to implement complex integration logic. To learn more, read about how Codeway leverages Cloud Storage FUSE for generative AI, check out the official documentation, or see the Cloud Storage FUSE GitHub page.

Related Article

Announcing Cloud Storage FUSE and GKE CSI driver for AI/ML workloads

Now in Preview, Cloud Storage FUSE CSI driver lets you access objects in buckets as files mounted as a local file system in GKE.

Read Article

AI Generated Robotic Content

Recent Posts

11 Best Beard Trimmers (2024): Full Beards, Hair, Stubble

These beard tools deliver a quality trim for all types of facial hair.

13 hours ago

5 of the Most Influential Machine Learning Papers of 2024

Artificial intelligence (AI) research, particularly in the machine learning (ML) domain, continues to increase the…

2 days ago

PEFT fine tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Training large language models (LLMs) models has become a significant expense for businesses. For many…

2 days ago

OpenAI’s o3 shows remarkable progress on ARC-AGI, sparking debate on AI reasoning

o3 solved one of the most difficult AI challenges, scoring 75.7% on the ARC-AGI benchmark.…

2 days ago

How NASA Might Change Under Donald Trump

The Trump transition team is looking for “big changes” at NASA—including some cuts.

2 days ago

An AI system has reached human level on a test for ‘general intelligence’—here’s what that means

A new artificial intelligence (AI) model has just achieved human-level results on a test designed…

2 days ago