Categories: FAANG

Use Gemini CLI to deploy cost-effective LLM workloads on GKE

Deploying LLM workloads can be complex and costly, often involving a lengthy, multi-step process. To solve this, Google Kubernetes Engine (GKE) offers Inference Quickstart.

With Inference Quickstart, you can replace months of manual trial-and-error with out-of-the-box manifests and data-driven insights. Inference Quickstart integrates with the Gemini CLI through native Model Context Protocol (MCP) support to offer tailored recommendations for your LLM workload cost and performance needs. Together, these tools empower you to analyze, select, and deploy your LLMs on GKE in a matter of minutes. Here’s how. 

1. Select and serve your LLM on GKE via Gemini CLI

You can install the gemini cli and gke-mcp server with the following steps:

code_block
<ListValue: [StructValue([(‘code’, ‘# install Gemini CLI (additional instructions)rnbrew install gemini-clirnrn# install gke-mcp as Gemini CLI extensionrngemini extensions install https://github.com/GoogleCloudPlatform/gke-mcp.git’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcb1c0>)])]>

Here are some example prompts that you can give Gemini CLI to select an LLM workload and generate the manifest needed to deploy the model to a GKE cluster:

code_block
<ListValue: [StructValue([(‘code’, “1. What are the 3 cheapest models available on GKE Inference Quickstart? Can you provide all of the related performance data and accelerators they ran on?rn2. How does this model’s performance compare when it was run on different accelerators?rn3. How do I choose between these 2 models?rn4. I’d like to generate a manifest for this model on this accelerator and save it to the current directory.”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a67dcba30>)])]>

This video below shows an end-to-end example of how you can quickly identify and deploy your optimal LLM workload to a pre-existing GKE cluster via this Gemini CLI setup:

2. Save money while maintaining performance

Choosing the right hardware for your inference workload means balancing performance and cost. The trade-off is nonlinear. To simplify this complex trade-off, Inference Quickstart provides performance and cost insights across various accelerators, all backed by Google’s benchmarks.

For example, as shown in the graph below, minimizing latency for a model like Gemma 3 4b on vLLM dramatically increases cost. This is because achieving ultra-low latency requires sacrificing the efficiency of request batching, which leaves your accelerators underutilized. Request load, model size, architecture, and workload characteristics can all impact which accelerator is optimal for your specific use case.

To make an informed decision, you can get instant, data-driven recommendations by asking Gemini CLI or using the Inference Quickstart Colab notebook.

3. Calculate cost per input/output token

When you host your own model on a platform like GKE, you are billed for accelerator time, not for each individual token. Inference Quickstart calculates cost per token using the accelerator’s hourly cost and the input/output throughput.

The following formula attributes the total accelerator cost to both input and output tokens:

code_block
<ListValue: [StructValue([(‘code’, ‘$/output token = Accelerator $/s / (1/4 input tokens/s + output tokens/s)rnrnwherern$/input token = ($/output token) / 4’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x7f8a6a55a790>)])]>

This formula assumes an output token costs four times as much as an input token. The reason for this heuristic is that the prefill phase (processing input tokens) is a highly parallel operation, whereas the decode phase (generating output tokens) is a sequential, auto-regressive process. You can ask Gemini CLI to change this ratio for you to fit your workload’s expected input/output ratio.

The key to cost-effective LLM inference is to take a data-driven approach. By relying on benchmarks for your workloads and using metrics like cost per token, you can make informed decisions that directly impact your budget and performance.

Next steps

GKE Inference Quickstart goes beyond cost insights and Gemini CLI integration, including optimizations for storage, autoscaling, and observability. Run your LLM workloads today with GKE Inference Quickstart to see how it can expedite and optimize your LLMs on GKE.

AI Generated Robotic Content

Recent Posts

Fine-tuning SDXL with childhood pictures → audio-reactive geometries – [Experiment]

After a deeply introspective and emotional journey, I fine-tuned SDXL using old family album pictures…

5 hours ago

Beyond Accuracy: 5 Metrics That Actually Matter for AI Agents

AI agents , or autonomous systems powered by agentic AI, have reshaped the current landscape…

5 hours ago

Apple Workshop on Reasoning and Planning 2025

Reasoning and planning are the bedrock of intelligent AI systems, enabling them to plan, interact,…

5 hours ago

MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix

Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish RastogiIntroductionNetflix’s core mission is to connect millions of members…

5 hours ago

Scaling data annotation using vision-language models to power physical AI systems

Critical labor shortages are constraining growth across manufacturing, logistics, construction, and agriculture. The problem is…

5 hours ago

Start Your Surround Sound Journey With $50 off This Klipsch Soundbar

This soundbar is just the beginning, with the option to add wireless bookshelf speakers or…

6 hours ago