Categories: FAANG

Use Gemini CLI to deploy cost-effective LLM workloads on GKE

image1 NKTHzu1max 1000x1000 1

Deploying LLM workloads can be complex and costly, often involving a lengthy, multi-step process. To solve this, Google Kubernetes Engine (GKE) offers Inference Quickstart.

With Inference Quickstart, you can replace months of manual trial-and-error with out-of-the-box manifests and data-driven insights. Inference Quickstart integrates with the Gemini CLI through native Model Context Protocol (MCP) support to offer tailored recommendations for your LLM workload cost and performance needs. Together, these tools empower you to analyze, select, and deploy your LLMs on GKE in a matter of minutes. Here’s how.

1. Select and serve your LLM on GKE via Gemini CLI

You can install the gemini cli and gke-mcp server with the following steps:

code_block: <ListValue: [StructValue([(‘code’, ‘# install Gemini CLI (additional instructions)rnbrew install gemini-clirnrn# install gke-mcp as Gemini CLI extensionrngemini extensions install https://github.com/GoogleCloudPlatform/gke-mcp.git’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcb1c0>)])]>

Here are some example prompts that you can give Gemini CLI to select an LLM workload and generate the manifest needed to deploy the model to a GKE cluster:

code_block: <ListValue: [StructValue([(‘code’, “1. What are the 3 cheapest models available on GKE Inference Quickstart? Can you provide all of the related performance data and accelerators they ran on?rn2. How does this model’s performance compare when it was run on different accelerators?rn3. How do I choose between these 2 models?rn4. I’d like to generate a manifest for this model on this accelerator and save it to the current directory.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcba30>)])]>

This video below shows an end-to-end example of how you can quickly identify and deploy your optimal LLM workload to a pre-existing GKE cluster via this Gemini CLI setup:

2. Save money while maintaining performance

Choosing the right hardware for your inference workload means balancing performance and cost. The trade-off is nonlinear. To simplify this complex trade-off, Inference Quickstart provides performance and cost insights across various accelerators, all backed by Google’s benchmarks.

For example, as shown in the graph below, minimizing latency for a model like Gemma 3 4b on vLLM dramatically increases cost. This is because achieving ultra-low latency requires sacrificing the efficiency of request batching, which leaves your accelerators underutilized. Request load, model size, architecture, and workload characteristics can all impact which accelerator is optimal for your specific use case.

To make an informed decision, you can get instant, data-driven recommendations by asking Gemini CLI or using the Inference Quickstart Colab notebook.

3. Calculate cost per input/output token

When you host your own model on a platform like GKE, you are billed for accelerator time, not for each individual token. Inference Quickstart calculates cost per token using the accelerator’s hourly cost and the input/output throughput.

The following formula attributes the total accelerator cost to both input and output tokens:

code_block: <ListValue: [StructValue([(‘code’, ‘$/output token = Accelerator $/s / (1/4 input tokens/s + output tokens/s)rnrnwherern$/input token = ($/output token) / 4’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a6a55a790>)])]>

This formula assumes an output token costs four times as much as an input token. The reason for this heuristic is that the prefill phase (processing input tokens) is a highly parallel operation, whereas the decode phase (generating output tokens) is a sequential, auto-regressive process. You can ask Gemini CLI to change this ratio for you to fit your workload’s expected input/output ratio.

The key to cost-effective LLM inference is to take a data-driven approach. By relying on benchmarks for your workloads and using metrics like cost per token, you can make informed decisions that directly impact your budget and performance.

Next steps

GKE Inference Quickstart goes beyond cost insights and Gemini CLI integration, including optimizations for storage, autoscaling, and observability. Run your LLM workloads today with GKE Inference Quickstart to see how it can expedite and optimize your LLMs on GKE.