Categories: FAANG

Benchmark and optimize LLMs on-device with AI Edge Portal

LLMs have become more powerful at smaller sizes, but deploying them to edge devices like smartphones remains a massive challenge. Today, developers have to optimize across a sprawling combination of accelerators, operating systems, and countless System-on-a-Chip (SoC) configurations, often relying on manual testing with just a handful of devices. Google AI Edge Portal helps solve these challenges. 

By letting developers test ML workloads across a fleet of over 120 representative Android device types, Google AI Edge Portal provides deep insight into latency and performance across all CPU, GPU, and NPU backends.

Today, we are excited to announce two new capabilities that expand Google AI Edge Portal’s capabilities for the generative AI era: benchmarking and debugging on-device LLMs. These new services give developers what they need to optimize generative AI performance accurately and efficiently across the entire Android ecosystem.

Benchmark LLMs across over 120 different mobile devices

When a user interacts with an LLM-enabled experience in your app, they expect fast and consistent performance on their device. Common challenges like initialization time can result in your app appearing to freeze, or, in a worst case, crash completely if the model consumes all available memory.

With the latest release of Google AI Edge Portal, you can now run automated gen AI benchmarks directly on a physical lab of over 120 diverse Android devices and test for these scenarios specifically. Portal natively supports CPU and GPU benchmarking for LLMs in the LiteRT-LM format.

Customers can benchmark GenAI models on over 120 Android devices, viewing metrics including initialization time, prefill speed, decode speed, and peak memory usage.

When you trigger a gen AI benchmarking job with Portal, it profiles the critical metrics that dictate your end-users’ experience when interacting with your AI application on-device:

Metric

What it measures

Why it matters to you

Initialization time

Measures how long it takes to load your model into memory.

High initialization time can result in delays, or freeze the user interface when your application starts up.

Prefill speed

Captures how fast the device processes prompt tokens to generate the first output token.

Dictates the initial delay before the user sees the first response.

Decode speed

Captures how fast the model generates tokens during a response.

Dictates the speed at which output is generated.

Peak memory

Monitors maximum RAM usage.

Flags potential “out of memory” crash risk, especially prevalent on memory constrained devices.

With these insights, you can confidently decide which devices are ready to host your model and adjust or better optimize your LLMs for device targeting before shipping.

Debug performance easily with Model Explorer

Benchmarking is only useful if you can fix the discovered performance issues. When an LLM performs poorly, finding the root cause within the complex graph of multiple layers and thousands of nodes is a daunting task for developers, involving tedious and time-consuming searching that can take hours if not days.

To bridge this gap, we have added the ability to visualize and compare model graphs in Portal with ease. Through the natively integrated Model Explorer, our graph visualization tool, you can search and locate specific nodes, compare models side-by-side in the same tab, and view tensor shapes, trace inputs and outputs, and more. To further speed up debugging for teams, we also added the ability to take screenshots and share specific views directly with your collaborators in Google Cloud.

These visualizations are one of the most effective ways to identify targets for optimization, including:

  • Conversion: Model Explorer simplifies the identification of conversion anomalies through its dual-view comparison tool. This interface allows you to traverse complex model architectures by selectively expanding or collapsing specific layers, granting you the ability to analyze internal dependencies and structural nodes with precise granularity.
  • Quantization: Model Explorer aids in detecting specific operations where quantization may compromise performance. By sorting layers using error metrics, you can pinpoint precision loss, access granular per-layer data, and evaluate various quantization strategies to achieve an optimal balance between model footprint and output quality.
  • Optimization: Use Model Explorer to visualize hardware compatibility, organize operations by latency, and conduct granular, per-op performance comparisons across different hardware accelerators.

With Model Explorer, you can view model graphs, search for specific layers, and compare models side-by-side to debug performance.

Start benchmarking LLMs on-device today

With the era of LLMs on-device here, we are excited to help close the critical gap in benchmarking to bring the power of AI to the thousands of types of smartphones on the market today. To utilize these latest features, please complete our sign-up form here to express interest.

Google AI Edge Portal is currently available in private preview for allowlisted Google Cloud customers. During this private preview period, access is provided at no charge, subject to the preview terms. All current allowlisted customers will receive access to these new features automatically. 

We can’t wait to see what gen AI capabilities you are able to deploy across the full spectrum of devices with Google AI Edge Portal!


Thank you to the members of the team, and collaborators for their contributions in making the advancements in this release possible: Akshat Sharma, Ami Kubota, Charlie Xu, Chunlei Niu, Cormac Brick, Derek Bekebrede, Eric Yang, Jing Jin, Kathleen Low, Matthias Grundmann, Marissa Ikonomidis, Na Li, Ram Iyengar, Sachin Kotwani, Sommayah Soliman, Tenghui Zhu, Xiaoming Hu, Zi Yuan

AI Generated Robotic Content

Recent Posts

Extreme realism with Klein 9B distilled 2 loras together

Depois de gerar vários prompts e combinar vários LoRas, tentei tudo o que você pode…

55 mins ago

Agentic Programming: A Roadmap

Here is the number that defines the current state of things:

55 mins ago

Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints

Today, Amazon SageMaker AI introduces OpenAI-compatible API support for real-time inference endpoints. If you use…

55 mins ago

SpaceX Listed Grok’s ‘Spicy’ Mode as a Risk in Its IPO Filing

The rocket company has set aside more than $500 million for potential litigation losses, in…

2 hours ago

Watching the detectors: Researchers probe efficacy—and danger—of AI detection tools

Patrick Traynor, Ph.D., has questions. When the professor and interim chair of the University of…

2 hours ago

Nvidia RTX 2 pass Upscaler (4GB VRAM + 8GB RAM)

Official Link : Nvidia docs NVIDIA RTX 2-Pass Upscaler (4GB VRAM + 8GB RAM) Post:…

1 day ago