Categories: FAANG

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

With Vertex AI Model Garden, Google Cloud strives to deliver highly efficient and cost-optimized ML workflow recipes. Currently, it offers a selection of more than 150 first-party, open and third-party foundation models. Last year, we introduced the popular open source LLM serving stack vLLM on GPUs, in Vertex Model Garden. Since then, we have witnessed rapid growth of serving deployment. Today, we are thrilled to introduce Hex-LLM, High-Efficiency LLM Serving with XLA, on TPUs in Vertex AI Model Garden.

Hex-LLM is Vertex AI’s in-house LLM serving framework that is designed and optimized for Google’s Cloud TPU hardware, which is available as part of AI Hypercomputer. Hex-LLM combines state-of-the-art LLM serving technologies, including continuous batching and paged attention, and in-house optimizations that are tailored for XLA/TPU, representing our latest high-efficiency and low-cost LLM serving solution on TPU for open-source models. Hex-LLM is now available in Vertex AI Model Garden via playground, notebook, and one-click deployment. We can’t wait to see how Hex-LLM and Cloud TPUs can help your LLM serving workflows.

Design and benchmarks

Hex-LLM is inspired by a number of successful open-source projects, including vLLM and FlashAttention, and incorporates the latest LLM serving technologies and in-house optimizations that are tailored for XLA/TPU.

The key optimizations in Hex-LLM include:

  • The token-based continuous batching algorithm to ensure the maximal memory utilization for KV caching.

  • A complete rewrite of the PagedAttention kernel that is optimized for XLA/TPU.

  • Flexible and composable data parallelism and tensor parallelism strategies with special weights sharding optimization to run large models efficiently on multiple TPU chips.

In addition, Hex-LLM supports a wide range of popular dense and sparse LLM models, including:

  • Gemma 2B and 7B

  • Gemma 2 9B and 27B

  • Llama 2 7B, 13B, and 70B

  • Llama 3 8B and 70B

  • Mistral 7B and Mixtral 8x7B

As the LLM field keeps evolving, we are also committed to bringing in more advanced technologies and the latest and greatest foundation models to Hex-LLM.

Hex-LLM delivers competitive performance with high throughput and low latency. We conducted benchmark experiments, and the metrics that we measured are explained as following: 

  • TPS (tokens per second) is the average number of tokens a LLM server receives per second. Similar to QPS (queries per second), which is used for measuring the traffic of a general server, TPS is for measuring the traffic of a LLM server but in a more fine-grained fashion.

  • Throughput measures how many tokens the server can generate for a certain timespan at a specific TPS. This is a key metric for estimating the capability of processing a number of concurrent requests.

  • Latency measures the average time to generate one output token at a specific TPS. This estimates the end-to-end time spent on the server side for each request, including all queueing and processing time.

Note that there is usually a tradeoff between high throughput and low latency. As TPS increases, both throughput and latency should increase. The throughput will saturate at a certain TPS, while the latency will continue to increase with higher TPS. Thus, given a particular TPS, we can measure a pair of throughput and latency metrics of the server. The resultant throughput-latency plot with respect to different TPS gives an accurate measurement of the LLM server performance.

The data used to benchmark Hex-LLM is sampled from the ShareGPT dataset, which is a widely adopted dataset containing prompts and outputs in variable lengths.

In the following charts, we present the performance of Gemma 7B and Llama 2 70B (int8 weight quantized) models on eight TPU v5e chips:

  • Gemma 7B model: 6ms per output token for the lowest TPS, 6250 output tokens per second at the highest TPS;

  • Llama 2 70B int8 model: 26ms per output token for the lowest TPS, 1510 output tokens per second at the highest TPS.

Get started in Vertex AI Model Garden

We have integrated the Hex-LLM TPU serving container into Vertex AI Model Garden. Users can access this serving technology through the playground, one-click deployment, or Colab Enterprise Notebook examples for a variety of models.

Vertex AI Model Garden’s playground is a pre-deployed Vertex AI Prediction endpoint integrated into the UI. Users type in the prompt text and the optional arguments of the request, click the SUBMIT button, and then get the model response quickly. Try it out with Gemma!

To deploy a custom Vertex Prediction endpoint with Hex-LLM, one-click deployment through the model card UI is the easiest approach:

1. Navigate to the model card page and click on the “DEPLOY” button.

2. For the model variation of interest, select the TPU v5e machine type ct5lp-hightpu-*t for deployment. Click “DEPLOY” at the bottom to begin the deployment process. You can receive two email notifications: one for when the model is uploaded and one for when the endpoint is ready.

For maximum flexibility, users can use Colab Enterprise notebook examples to deploy a Vertex Prediction endpoint with Hex-LLM using the Vertex Python SDK.

1. Navigate to the model card page and click on the “OPEN NOTEBOOK” button.

2. Select the Vertex Serving notebook. This will open the notebook in Colab Enterprise.

3. Run through the notebook to deploy using Hex-LLM and send prediction requests to the endpoint.

The code snippet for the deployment function is as follows.

code_block
<ListValue: [StructValue([(‘code’, ‘def deploy_model_hexllm(rn model_name: str,rn model_id: str,rn service_account: str,rn machine_type: str = “ct5lp-hightpu-1t”,rn tensor_parallel_size: int = 1,rn hbm_utilization_factor: float = 0.6,rn max_running_seqs: int = 256,rn endpoint_id: str = “”,rn min_replica_count: int = 1,rn max_replica_count: int = 1,rn) -> Tuple[aiplatform.Model, aiplatform.Endpoint]:rn “””Deploys models with Hex-LLM on TPU in Vertex AI.”””rn if endpoint_id:rn aip_endpoint_name = (rn f”projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint_id}”rn )rn endpoint = aiplatform.Endpoint(aip_endpoint_name)rn else:rn endpoint = aiplatform.Endpoint.create(display_name=f”{model_name}-endpoint”)rnrn hexllm_args = [rn “–host=0.0.0.0”,rn “–port=7080”,rn “–log_level=INFO”,rn “–enable_jit”,rn f”–model={model_id}”,rn “–load_format=auto”,rn f”–tensor_parallel_size={tensor_parallel_size}”,rn f”–hbm_utilization_factor={hbm_utilization_factor}”,rn f”–max_running_seqs={max_running_seqs}”,rn ]rn hexllm_envs = {rn “PJRT_DEVICE”: “TPU”,rn “RAY_DEDUP_LOGS”: “0”,rn “RAY_USAGE_STATS_ENABLED”: “0”,rn “MODEL_ID”: model_id,rn “DEPLOY_SOURCE”: “notebook”,rn }rn if HF_TOKEN:rn hexllm_envs.update({“HF_TOKEN”: HF_TOKEN})rnrn model = aiplatform.Model.upload(rn display_name=model_name,rn serving_container_image_uri=HEXLLM_DOCKER_URI,rn serving_container_command=[rn “python”, “-m”, “hex_llm.server.api_server”rn ],rn serving_container_args=hexllm_args,rn serving_container_ports=[7080],rn serving_container_predict_route=”/generate”,rn serving_container_health_route=”/ping”,rn serving_container_environment_variables=hexllm_envs,rn serving_container_shared_memory_size_mb=(16 * 1024), # 16 GBrn serving_container_deployment_timeout=7200,rn )rnrn model.deploy(rn endpoint=endpoint,rn machine_type=machine_type,rn deploy_request_timeout=1800,rn service_account=service_account,rn min_replica_count=min_replica_count,rn max_replica_count=max_replica_count,rn )rn return model, endpoint’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e2058e6d6a0>)])]>

Here, users can customize the deployment to best align with their needs. For instance, they can deploy with multiple replicas to handle a large amount of projected traffic.

Take the next step

To learn more about how Vertex AI can help your organization, click here, and to learn more about how Google Cloud customers are innovating with generative AI, read How 7 businesses are putting Google Cloud’s AI innovations to work.

AI Generated Robotic Content

Recent Posts

Can “Safe AI” Companies Survive in an Unrestrained AI Landscape?

TL;DR A conversation with 4o about the potential demise of companies like Anthropic. As artificial…

16 hours ago

Large language overkill: How SLMs can beat their bigger, resource-intensive cousins

Whether a company begins with a proof-of-concept or live deployment, they should start small, test…

17 hours ago

14 Best Planners: Weekly and Daily Notebooks & Accessories (2024)

Digital tools are not always superior. Here are some WIRED-tested agendas and notebooks to keep…

17 hours ago

5 Tools for Visualizing Machine Learning Models

Machine learning (ML) models are built upon data.

2 days ago

AI Systems Governance through the Palantir Platform

Editor’s note: This is the second post in a series that explores a range of…

2 days ago

Introducing Configurable Metaflow

David J. Berg*, David Casler^, Romain Cledat*, Qian Huang*, Rui Lin*, Nissan Pow*, Nurcan Sonmez*,…

2 days ago