Categories: FAANG

By Jove, It’s No Myth: NVIDIA Triton Speeds Inference on Oracle Cloud

An avid cyclist, Thomas Park knows the value of having lots of gears to maintain a smooth, fast ride.

So, when the software architect designed an AI inference platform to serve predictions for Oracle Cloud Infrastructure’s (OCI) Vision AI service, he picked NVIDIA Triton Inference Server. That’s because it can shift up, down or sideways to handle virtually any AI model, framework and hardware and operating mode — quickly and efficiently.

“The NVIDIA AI inference platform gives our worldwide cloud services customers tremendous flexibility in how they build and run their AI applications,” said Park, a Zurich-based computer engineer and competitive cycler who’s worked for four of the world’s largest cloud services providers.

Specifically, Triton reduced OCI’s total cost of ownership by 10%, increased prediction throughput up to 76% and reduced inference latency up to 51% for OCI Vision and Document Understanding Service models that were migrated to Triton. The services run globally across more than 45 regional data centers, according to an Oracle blog Park and a colleague posted earlier this year.

Computer Vision Accelerates Insights

Customers rely on OCI Vision AI for a wide variety of object detection and image classification jobs. For instance, a U.S.-based transit agency uses it to automatically detect the number of vehicle axles passing by to calculate and bill bridge tolls, sparing busy truckers wait time at toll booths.

OCI AI is also available in Oracle NetSuite, a set of business applications used by more than 37,000 organizations worldwide. It’s used, for example, to automate invoice recognition.

Thanks to Park’s work, Triton is now being adopted across other OCI services, too.

A Triton-Aware Data Service

“We’ve built a Triton-aware AI platform for our customers,” said Tzvi Keisar, a director of product management for OCI’s Data Science service, which handles machine learning for Oracle’s internal and external users.

“If customers want to use Triton, we’ll save them time by automatically doing the configuration work for them in the background, launching a Triton-powered inference endpoint for them,” said Keisar.

His team also plans to make it even easier for its other users to embrace the fast, flexible inference server. Triton is included in NVIDIA AI Enterprise, a platform that provides full security and support businesses need — and it’s available on OCI Marketplace.

A Massive SaaS Platform

OCI’s Data Science service is the machine learning platform for both NetSuite and Oracle Fusion software-as-a-service applications.

“These platforms are massive, with tens of thousands of customers who are also building their work on top of our service,” he said.

It’s a wide swath of mainly enterprise users in manufacturing, retail, transportation and other industries. They’re building and using AI models of nearly every shape and size.

Inference was one of the group’s first services, and Triton came on the team’s radar not long after its launch.

A Best-in-Class Inference Framework

“We saw Triton pick up in popularity as a best-in-class serving framework, so we started experimenting with it,” Keisar said. “We saw really good performance, and it closed a gap in our existing offerings, especially on multi-model inference — it’s the most versatile and advanced inferencing framework out there.”

Launched on OCI in March, Triton has already attracted the attention of many internal teams at Oracle hoping to use it for inference jobs that require serving predictions from multiple AI models running concurrently.

“Triton has a very good track record and performance on multiple models deployed on a single endpoint,” he said.

Accelerating the Future

Looking ahead, Keisar’s team is evaluating NVIDIA TensorRT-LLM software to supercharge inference on the complex large language models (LLMs) that have captured the imagination of many users.

An active blogger, Keisar’s latest article detailed creative quantization techniques for running a Llama 2 LLM with a whopping 70 billion parameters on NVIDIA A10 Tensor Core GPUs.

“Even down to four bits, the quality of model outputs is still quite good,” he said. “I can’t explain all the math, but we found a good balance, and I haven’t seen anyone else do this yet.”

After announcements this fall that Oracle is deploying the latest NVIDIA H100 Tensor Core GPUs, H200 GPUs, L40S GPUs and Grace Hopper Superchips, it’s just the start of many accelerated efforts to come.

AI Generated Robotic Content

Recent Posts

Does anyone else can’t stand ComfyUI and prefers classic Automatic/Forge UI or it’s just me?

EDIT: I can't believe how many great and useful replies I've got, and not a…

7 hours ago

Serving Multiple Users at Once: How Continuous Batching Keeps LLM Inference Efficient

This article is divided into four parts; they are: • The Problem with Static Batching…

7 hours ago

Everyone Has Their Targets Set on the MacBook Neo

Dell, Microsoft, and others are unveiling new laptops to compete directly with the Neo, but…

8 hours ago

Photon-driven synapse advances low-power neuromorphic systems

Modern artificial intelligence systems rely on moving large amounts of data between memory and processors,…

8 hours ago

Anima – Sharing Some Prompts and Results

Been experimenting with Anima lately and ended up spending way too much time refining prompts.…

1 day ago

Keychron K2 HE Concrete Edition Review: Rock-Solid Typing

Keychron's K2 HE Concrete Edition sounds like a cute gimmick, but as I discovered, there's…

1 day ago