1 GcYLFfO.max 1000x1000 1
Generative AI diffusion models such as Stable Diffusion and Flux produce stunning visuals, empowering creators across various verticals with impressive image generation capabilities. However, generating high-quality images through sophisticated pipelines can be computationally demanding, even with powerful hardware like GPUs and TPUs, impacting both costs and time-to-result.
The key challenge lies in optimizing the entire pipeline to minimize cost and latency without compromising on image quality. This delicate balance is crucial for unlocking the full potential of image generation in real-world applications. For example, before reducing the model size to cut image generation costs, prioritize optimizing the underlying infrastructure and software to ensure peak model performance.
At Google Cloud Consulting, we’ve been assisting customers in navigating these complexities. We understand the importance of optimized image generation pipelines, and in this post, we’ll share three proven strategies to help you achieve both efficiency and cost-effectiveness, to deliver exceptional user experiences.
We recommend having a comprehensive optimization strategy that addresses all aspects of the pipeline, from hardware to code to overall architecture. One way that we address this at Google Cloud is with AI Hypercomputer, a composable supercomputing architecture that brings together hardware like TPUs & GPUs, along with software and frameworks like Pytorch. Here’s a breakdown of the key areas we focus on:
Image generation pipelines often require GPUs or TPUs for deployment, and optimizing hardware utilization can significantly reduce costs. Since GPUs cannot be allocated fractionally, underutilization is common, especially when scaling workloads, leading to inefficiency and increased cost of operation. To address this, Google Kubernetes Engine (GKE) offers several GPU sharing strategies to improve resource efficiency. Additionally, A3 High VMs with NVIDIA H100 80GB GPUs come in smaller sizes, helping you scale efficiently and control costs.
Some key GPU sharing strategies in GKE include:
Example illustration of GPU sharing strategies
When you have an existing pipeline written natively in PyTorch, you have several options to optimize and reduce the pipeline execution time.
One way is to use PyTorch’s compile method, which enables Just-in-time (JIT) compiling PyTorch code into optimized kernels for faster execution, especially for the forward pass of the decoder step. You can do this through various backend computers such as NVIDIA TensorRT, OpenVINO or IPEX depending on the underlying hardware. You can also use certain compiler backends at training time. A similar JIT compilation capability is also available for other frameworks such as JAX.
Another way to improve code latency is to enable Flash Attention. By enabling the torch.backends.cuda.enable_flash_sdp attribute, PyTorch code natively runs Flash Attention where it is helpful in speeding up a given computation, while automatically selecting another attention mechanism if Flash isn’t optimal based on the inputs.
Additionally, to reduce latency, you also need to minimize data transfers between the GPU and CPU. Operations such as tensor loading, or comparing tensors and Python floats, incur significant overhead due to the data movement. Each time a tensor is compared with a floating-point value, it must be transferred to the CPU, incurring latency. Ideally, you should only load and offload a tensor on and off the GPU once in the entire image generation pipeline; this is especially important for image generation pipelines that utilize several models, where latency cascades with each model that is run. Tools such as PyTorch Profiler help us observe the time and memory utilization of a model.
While optimizing code can help speed up individual modules within a pipeline, you need to take a look at the big picture. Many multi-step image-generation pipelines cascade multiple models (e.g., samplers, decoders, and image and/or text embedding models) one after the other to generate the final image, often on a single container that has a single GPU attached.
For Diffusion-based pipelines, models such as decoders can have significantly higher computational complexity and hence take longer to execute, especially as compared with embedding models, which are generally faster. That means that certain models can cause a bottleneck in the generation pipeline. To optimize GPU utilization and mitigate this bottleneck, you may consider employing a multi-threaded queue-based approach for efficient task scheduling and execution. This approach enables parallel execution of different pipeline stages on the same GPU, allowing for concurrent processing of several requests. Efficiently distributing tasks among worker threads minimizes GPU idle time and maximizes resource utilization, ultimately leading to higher throughput.
Furthermore, by maintaining tensors on the same GPU throughout the process, you can reduce the overhead of CPU-to-GPU (and vice-versa) data transfers, further enhancing efficiency and reducing costs.
Processing time comparison between a stacked pipeline and a multithreaded pipeline for 2 concurrent requests
Optimizing image-generation pipelines is a multifaceted process, but the rewards are significant. By adopting a comprehensive approach that includes hardware optimization for maximized resource utilization, code optimization for faster execution, and pipeline optimization for increased throughput, you can achieve substantial performance gains, reduce costs, and deliver exceptional user experiences. In our work with customers, we’ve consistently seen that implementing these optimization strategies can lead to significant cost savings without compromising image quality.
Ready to get started?
At Google Cloud Consulting, we’re dedicated to helping our customers build and deploy high-performing AI solutions. If you’re looking to optimize your image generation pipelines, connect with Google Cloud Consulting today, and we’ll work to help you unlock the full potential of your AI initiatives.
We extend our sincere gratitude to Akhil Sakarwal, Ashish Tendulkar, Abhijat Gupta, and Suraj Kanojia for their invaluable support and guidance throughout the crucial experimentation phase.
A lot (if not nearly all) of the success and progress made by many generative…
CBP’s acting commissioner has rescinded four Biden-era policies that aimed to protect vulnerable people in…
A team of researchers has unveiled a cutting-edge Amphibious Robotic Dog capable of roving across…
Brown University researchers have developed an artificial intelligence model that can generate movement in robots…
Machine learning models deliver real value only when they reach users, and APIs are the…
Enhancing Member Experience Through Strategic CollaborationOzzie Sutherland, Iroro Orife, Chih-Wei Wu, Bhanu SrikanthAt Netflix, delivering the…