Categories: FAANG

Benchmark and optimize LLMs on-device with AI Edge Portal

B Portal

LLMs have become more powerful at smaller sizes, but deploying them to edge devices like smartphones remains a massive challenge. Today, developers have to optimize across a sprawling combination of accelerators, operating systems, and countless System-on-a-Chip (SoC) configurations, often relying on manual testing with just a handful of devices. Google AI Edge Portal helps solve these challenges.

By letting developers test ML workloads across a fleet of over 120 representative Android device types, Google AI Edge Portal provides deep insight into latency and performance across all CPU, GPU, and NPU backends.

Today, we are excited to announce two new capabilities that expand Google AI Edge Portal’s capabilities for the generative AI era: benchmarking and debugging on-device LLMs. These new services give developers what they need to optimize generative AI performance accurately and efficiently across the entire Android ecosystem.

Benchmark LLMs across over 120 different mobile devices

When a user interacts with an LLM-enabled experience in your app, they expect fast and consistent performance on their device. Common challenges like initialization time can result in your app appearing to freeze, or, in a worst case, crash completely if the model consumes all available memory.

With the latest release of Google AI Edge Portal, you can now run automated gen AI benchmarks directly on a physical lab of over 120 diverse Android devices and test for these scenarios specifically. Portal natively supports CPU and GPU benchmarking for LLMs in the LiteRT-LM format.

Customers can benchmark GenAI models on over 120 Android devices, viewing metrics including initialization time, prefill speed, decode speed, and peak memory usage.

When you trigger a gen AI benchmarking job with Portal, it profiles the critical metrics that dictate your end-users’ experience when interacting with your AI application on-device:

Metric	What it measures	Why it matters to you
Initialization time	Measures how long it takes to load your model into memory.	High initialization time can result in delays, or freeze the user interface when your application starts up.
Prefill speed	Captures how fast the device processes prompt tokens to generate the first output token.	Dictates the initial delay before the user sees the first response.
Decode speed	Captures how fast the model generates tokens during a response.	Dictates the speed at which output is generated.
Peak memory	Monitors maximum RAM usage.	Flags potential “out of memory” crash risk, especially prevalent on memory constrained devices.

With these insights, you can confidently decide which devices are ready to host your model and adjust or better optimize your LLMs for device targeting before shipping.

Debug performance easily with Model Explorer

Benchmarking is only useful if you can fix the discovered performance issues. When an LLM performs poorly, finding the root cause within the complex graph of multiple layers and thousands of nodes is a daunting task for developers, involving tedious and time-consuming searching that can take hours if not days.

To bridge this gap, we have added the ability to visualize and compare model graphs in Portal with ease. Through the natively integrated Model Explorer, our graph visualization tool, you can search and locate specific nodes, compare models side-by-side in the same tab, and view tensor shapes, trace inputs and outputs, and more. To further speed up debugging for teams, we also added the ability to take screenshots and share specific views directly with your collaborators in Google Cloud.

These visualizations are one of the most effective ways to identify targets for optimization, including:

Conversion: Model Explorer simplifies the identification of conversion anomalies through its dual-view comparison tool. This interface allows you to traverse complex model architectures by selectively expanding or collapsing specific layers, granting you the ability to analyze internal dependencies and structural nodes with precise granularity.
Quantization: Model Explorer aids in detecting specific operations where quantization may compromise performance. By sorting layers using error metrics, you can pinpoint precision loss, access granular per-layer data, and evaluate various quantization strategies to achieve an optimal balance between model footprint and output quality.
Optimization: Use Model Explorer to visualize hardware compatibility, organize operations by latency, and conduct granular, per-op performance comparisons across different hardware accelerators.

With Model Explorer, you can view model graphs, search for specific layers, and compare models side-by-side to debug performance.

Start benchmarking LLMs on-device today

With the era of LLMs on-device here, we are excited to help close the critical gap in benchmarking to bring the power of AI to the thousands of types of smartphones on the market today. To utilize these latest features, please complete our sign-up form here to express interest.

Google AI Edge Portal is currently available in private preview for allowlisted Google Cloud customers. During this private preview period, access is provided at no charge, subject to the preview terms. All current allowlisted customers will receive access to these new features automatically.

We can’t wait to see what gen AI capabilities you are able to deploy across the full spectrum of devices with Google AI Edge Portal!

_{Thank you to the members of the team, and collaborators for their contributions in making the advancements in this release possible: Akshat Sharma, Ami Kubota, Charlie Xu, Chunlei Niu, Cormac Brick, Derek Bekebrede, Eric Yang, Jing Jin, Kathleen Low, Matthias Grundmann, Marissa Ikonomidis, Na Li, Ram Iyengar, Sachin Kotwani, Sommayah Soliman, Tenghui Zhu, Xiaoming Hu, Zi Yuan}

How Cohere is accelerating language model training with Google Cloud TPUs

September 1, 2022

In "FAANG"

On Device Llama 3.1 with Core ML

Many app developers are interested in building on device experiences that integrate increasingly capable large language models (LLMs). Running these models locally on Apple silicon enables developers to leverage the capabilities of the user's device for cost-effective inference, without sending data to and from third party servers, which also helps…

November 2, 2024

In "FAANG"

Ray Shines With NVIDIA AI: Anyscale Collaboration to Help Developers Build, Tune, Train and Scale Production LLMs

Large language model development is about to reach supersonic speed thanks to a collaboration between NVIDIA and Anyscale. At its annual Ray Summit developers conference, Anyscale — the company behind the fast growing open-source unified compute framework for scalable computing — announced today that it is bringing NVIDIA AI to…

September 19, 2023

In "FAANG"

AI Generated Robotic Content