Categories: FAANG

Use Gemini CLI to deploy cost-effective LLM workloads on GKE

Deploying LLM workloads can be complex and costly, often involving a lengthy, multi-step process. To solve this, Google Kubernetes Engine (GKE) offers Inference Quickstart.

With Inference Quickstart, you can replace months of manual trial-and-error with out-of-the-box manifests and data-driven insights. Inference Quickstart integrates with the Gemini CLI through native Model Context Protocol (MCP) support to offer tailored recommendations for your LLM workload cost and performance needs. Together, these tools empower you to analyze, select, and deploy your LLMs on GKE in a matter of minutes. Here’s how. 

1. Select and serve your LLM on GKE via Gemini CLI

You can install the gemini cli and gke-mcp server with the following steps:

code_block
<ListValue: [StructValue([(‘code’, ‘# install Gemini CLI (additional instructions)rnbrew install gemini-clirnrn# install gke-mcp as Gemini CLI extensionrngemini extensions install https://github.com/GoogleCloudPlatform/gke-mcp.git’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcb1c0>)])]>

Here are some example prompts that you can give Gemini CLI to select an LLM workload and generate the manifest needed to deploy the model to a GKE cluster:

code_block
<ListValue: [StructValue([(‘code’, “1. What are the 3 cheapest models available on GKE Inference Quickstart? Can you provide all of the related performance data and accelerators they ran on?rn2. How does this model’s performance compare when it was run on different accelerators?rn3. How do I choose between these 2 models?rn4. I’d like to generate a manifest for this model on this accelerator and save it to the current directory.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcba30>)])]>

This video below shows an end-to-end example of how you can quickly identify and deploy your optimal LLM workload to a pre-existing GKE cluster via this Gemini CLI setup:

2. Save money while maintaining performance

Choosing the right hardware for your inference workload means balancing performance and cost. The trade-off is nonlinear. To simplify this complex trade-off, Inference Quickstart provides performance and cost insights across various accelerators, all backed by Google’s benchmarks.

For example, as shown in the graph below, minimizing latency for a model like Gemma 3 4b on vLLM dramatically increases cost. This is because achieving ultra-low latency requires sacrificing the efficiency of request batching, which leaves your accelerators underutilized. Request load, model size, architecture, and workload characteristics can all impact which accelerator is optimal for your specific use case.

To make an informed decision, you can get instant, data-driven recommendations by asking Gemini CLI or using the Inference Quickstart Colab notebook.

3. Calculate cost per input/output token

When you host your own model on a platform like GKE, you are billed for accelerator time, not for each individual token. Inference Quickstart calculates cost per token using the accelerator’s hourly cost and the input/output throughput.

The following formula attributes the total accelerator cost to both input and output tokens:

code_block
<ListValue: [StructValue([(‘code’, ‘$/output token = Accelerator $/s / (1/4 input tokens/s + output tokens/s)rnrnwherern$/input token = ($/output token) / 4’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a6a55a790>)])]>

This formula assumes an output token costs four times as much as an input token. The reason for this heuristic is that the prefill phase (processing input tokens) is a highly parallel operation, whereas the decode phase (generating output tokens) is a sequential, auto-regressive process. You can ask Gemini CLI to change this ratio for you to fit your workload’s expected input/output ratio.

The key to cost-effective LLM inference is to take a data-driven approach. By relying on benchmarks for your workloads and using metrics like cost per token, you can make informed decisions that directly impact your budget and performance.

Next steps

GKE Inference Quickstart goes beyond cost insights and Gemini CLI integration, including optimizations for storage, autoscaling, and observability. Run your LLM workloads today with GKE Inference Quickstart to see how it can expedite and optimize your LLMs on GKE.

AI Generated Robotic Content

Recent Posts

Future of AI image generators

Listen. I honestly don’t know whether this is just coincidence, a deliberate decision, or simply…

20 hours ago

Implementing Prompt Compression to Reduce Agentic Loop Costs

Agentic loops in production can be synonymous with high costs, especially when it comes to…

20 hours ago

Building web search-enabled agents with Strands and Exa

This post is co written by Ishan Goswami and Nitya Sridhar from Exa. If you…

20 hours ago

Cloud Storage Rapid: Turbocharged object storage for AI and analytics

At Google Cloud Next ’26 we announced Cloud Storage Rapid, a family of object storage…

20 hours ago

Ilya Sutskever Stands by His Role in Sam Altman’s OpenAI Ouster: ‘I Didn’t Want It to Be Destroyed’

The former OpenAI chief scientist may be estranged from the company, but he still came…

21 hours ago

People struggle to recall whether content came from AI, with labels forgotten after one week

From August 2026, an EU-wide AI regulation will come into force requiring the labeling of…

21 hours ago