Categories: FAANG

Use Gemini CLI to deploy cost-effective LLM workloads on GKE

Deploying LLM workloads can be complex and costly, often involving a lengthy, multi-step process. To solve this, Google Kubernetes Engine (GKE) offers Inference Quickstart.

With Inference Quickstart, you can replace months of manual trial-and-error with out-of-the-box manifests and data-driven insights. Inference Quickstart integrates with the Gemini CLI through native Model Context Protocol (MCP) support to offer tailored recommendations for your LLM workload cost and performance needs. Together, these tools empower you to analyze, select, and deploy your LLMs on GKE in a matter of minutes. Here’s how. 

1. Select and serve your LLM on GKE via Gemini CLI

You can install the gemini cli and gke-mcp server with the following steps:

code_block
<ListValue: [StructValue([(‘code’, ‘# install Gemini CLI (additional instructions)rnbrew install gemini-clirnrn# install gke-mcp as Gemini CLI extensionrngemini extensions install https://github.com/GoogleCloudPlatform/gke-mcp.git’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcb1c0>)])]>

Here are some example prompts that you can give Gemini CLI to select an LLM workload and generate the manifest needed to deploy the model to a GKE cluster:

code_block
<ListValue: [StructValue([(‘code’, “1. What are the 3 cheapest models available on GKE Inference Quickstart? Can you provide all of the related performance data and accelerators they ran on?rn2. How does this model’s performance compare when it was run on different accelerators?rn3. How do I choose between these 2 models?rn4. I’d like to generate a manifest for this model on this accelerator and save it to the current directory.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcba30>)])]>

This video below shows an end-to-end example of how you can quickly identify and deploy your optimal LLM workload to a pre-existing GKE cluster via this Gemini CLI setup:

2. Save money while maintaining performance

Choosing the right hardware for your inference workload means balancing performance and cost. The trade-off is nonlinear. To simplify this complex trade-off, Inference Quickstart provides performance and cost insights across various accelerators, all backed by Google’s benchmarks.

For example, as shown in the graph below, minimizing latency for a model like Gemma 3 4b on vLLM dramatically increases cost. This is because achieving ultra-low latency requires sacrificing the efficiency of request batching, which leaves your accelerators underutilized. Request load, model size, architecture, and workload characteristics can all impact which accelerator is optimal for your specific use case.

To make an informed decision, you can get instant, data-driven recommendations by asking Gemini CLI or using the Inference Quickstart Colab notebook.

3. Calculate cost per input/output token

When you host your own model on a platform like GKE, you are billed for accelerator time, not for each individual token. Inference Quickstart calculates cost per token using the accelerator’s hourly cost and the input/output throughput.

The following formula attributes the total accelerator cost to both input and output tokens:

code_block
<ListValue: [StructValue([(‘code’, ‘$/output token = Accelerator $/s / (1/4 input tokens/s + output tokens/s)rnrnwherern$/input token = ($/output token) / 4’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a6a55a790>)])]>

This formula assumes an output token costs four times as much as an input token. The reason for this heuristic is that the prefill phase (processing input tokens) is a highly parallel operation, whereas the decode phase (generating output tokens) is a sequential, auto-regressive process. You can ask Gemini CLI to change this ratio for you to fit your workload’s expected input/output ratio.

The key to cost-effective LLM inference is to take a data-driven approach. By relying on benchmarks for your workloads and using metrics like cost per token, you can make informed decisions that directly impact your budget and performance.

Next steps

GKE Inference Quickstart goes beyond cost insights and Gemini CLI integration, including optimizations for storage, autoscaling, and observability. Run your LLM workloads today with GKE Inference Quickstart to see how it can expedite and optimize your LLMs on GKE.

AI Generated Robotic Content

Recent Posts

Workflow upscale/magnify video from Sora with Wan , based on cseti007

📦 : https://github.com/lovisdotio/workflow-magnify-upscale-video-comfyui-lovis I did this ComfyUI workflow for Sora 2 upscaling 🚀 ( or…

11 hours ago

The Complete Guide to Pydantic for Python Developers

Python's flexibility with data types is convenient when coding, but it can lead to runtime…

11 hours ago

Rooms from Motion: Un-posed Indoor 3D Object Detection as Localization and Mapping

We revisit scene-level 3D object detection as the output of an object-centric framework capable of…

11 hours ago

Inside the AIPCon 8 Demos Transforming Manufacturing, Insurance, and Construction

Editor’s Note: This is the second in a two-part series highlighting demo sessions from AIPCon…

11 hours ago

Responsible AI design in healthcare and life sciences

Generative AI has emerged as a transformative technology in healthcare, driving digital transformation in essential…

11 hours ago

5 ad agencies used Gemini 2.5 Pro and gen media models to create an “impossible ad”

The conversation around generative AI in the enterprise is getting creative.  Since launching our popular…

11 hours ago