Categories: FAANG

Shining Brighter Together: Google’s Gemma Optimized to Run on NVIDIA GPUs

NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create the Gemini models — with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud, and locally on workstations with NVIDIA RTX GPUs or PCs with GeForce RTX GPUs.

This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.

Developers can also run Gemma on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s H200 Tensor Core GPUs — featuring 141GB of HBM3e memory at 4.8 terabytes per second — which Google will deploy this year.

Enterprise developers can additionally take advantage of NVIDIA’s rich ecosystem of tools — including NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM — to fine-tune Gemma and deploy the optimized model in their production applications.

Learn more about how TensorRT-LLM is revving up inference for Gemma, along with additional information for developers. This includes several model checkpoints of Gemma and the FP8-quantized version of the model, all optimized with TensorRT-LLM.

Experience Gemma 2B and Gemma 7B directly from your browser on the NVIDIA AI Playground.

Gemma Coming to Chat With RTX

Adding support for Gemma soon is Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs.

The Chat with RTX lets users personalize a chatbot with their own data by easily connecting local files on an RTX PC to a large language model.

Since the model runs locally, it provides results fast, and user data stays on the device. Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.

AI Generated Robotic Content

Recent Posts

Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally…

7 hours ago

How Fastweb fine-tuned the Mistral model using Amazon SageMaker HyperPod as a first step to build an Italian large language model

This post is co-written with Marta Cavalleri and Giovanni Germani from Fastweb, and Claudia Sacco…

7 hours ago

Optimizing RAG retrieval: Test, tune, succeed

Retrieval-augmented generation (RAG) supercharges large language models (LLMs) by connecting them to real-time, proprietary, and…

7 hours ago

NASA Postpones Return of Stranded Starliner Astronauts to March

Barry Wilmore and Suni Williams will now come home in March at the earliest, to…

8 hours ago

Swarms of ‘ant-like’ robots lift heavy objects and hurl themselves over obstacles

Scientists have developed swarms of tiny magnetic robots that work together like ants to achieve…

8 hours ago

Human-like artificial intelligence may face greater blame for moral violations

In a new study, participants tended to assign greater blame to artificial intelligences (AIs) involved…

8 hours ago