10l0prhKlcOjf d3 JVQTJg

Dynamically Splitting Wide Partitions in Cassandra for Time Series Workloads

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan Introduction Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal event data with millisecond latency. We use Apache Cassandra 4.x as the underlying storage for these main reasons: Throughput, latency, and cost: Cassandra can handle millions of low‑latency reads and writes …

ML 20384 1

The art and science of hyperparameter optimization on Amazon Nova Forge

Large language models (LLMs) deliver strong results on general tasks, but they often struggle with specialized work that requires understanding proprietary data, internal processes, or domain-specific terminology. Amazon Nova Forge addresses this by enabling you to build your own frontier models using Amazon Nova. You can start development from early model checkpoints, blend proprietary data …

ML 21022 1

Reference your own AWS Secrets Manager secrets in Amazon Bedrock AgentCore Identity

AI agents are only as powerful as the tools they can access. Whether retrieving customer data from a CRM, posting updates to Slack, or querying a GitHub repository, agents need to call external APIs, and that means securely passing credentials at runtime. Getting that right, without hardcoding secrets in code or exposing them in agent …

2 Architecturemax 1000x1000 1

How Trustpilot built a real-time architecture for data enrichment using Gemma

Processing millions of user reviews in real-time, under strict latency and cost constraints, is no easy task. Trustpilot has been doing exactly that with custom machine learning since long before large language models (LLMs) were cool. Now, as the company transitions its core stack to generative AI, here is a look at how we teamed …

174bhmTRm27PKE4K2q RMXQ

Enterprise Business Software and the Mixed-Up Chameleon Problem

Editor’s Note: This blog post was written by Greg Little, Senior Counselor at Palantir, with Aaron Jaffe, Senior Vice President at Palantir. Over 10 years of implementing Enterprise Resource Planning (ERP) systems, I remember one project where the CFO stopped the room cold. It was 11:30 at night during a mock cutover. People were exhausted …

16XXQ786vx4AwImVynUAU6A

High-Throughput Graph Abstraction at Netflix: Part I

By Oleksii Tkachuk, Kartik Sathyanarayanan, Rajiv Shringi Introduction Netflix has a diverse range of graph use cases, each serving specific business needs with unique functionality and performance requirements. These use cases fall into two broad categories: OLAP: These use cases typically involve open-ended and algorithmic exploration of large graph datasets. They often utilize industry-standard models and …

ML 21002 1 1

Comprehensive observability for Amazon SageMaker AI LLM inference: From GPU utilization to LLM quality

Deploying large language models (LLMs) at scale on Amazon SageMaker AI Inference makes observability a critical pillar of any production machine learning (ML) strategy. Unlike conventional software that returns deterministic outputs, LLMs generate variable, free-form responses that are difficult to validate with standard metrics. LLM output quality can change over time as input distributions shift, …

UsmanCLUMmax 1000x1000 1

Cloud CISO Perspectives: How to build an AI-ready security program for the public sector

Welcome to the second Cloud CISO Perspectives for May 2026. Today, Usman Chaudhary, Field CISO, Google Public Sector, offers a guide for CISOs protecting government agencies and critical infrastructure on how to get started — and get the most out of — defending with AI. As with all Cloud CISO Perspectives, the contents of this …

ML 20305 image 1

Training Azerbaijani language models on Amazon SageMaker AI

This solution builds on open source tools including PyTorch, Hugging Face Transformers, and Liger Kernels. The authors would also like to thank Aiham Taleb, Arefeh Ghahvechi, Manav Choudhary, Rohit Thekkanal, Daz Akbarov, Jamila Jamilova, Ross Povelikin, Almas Moldakanov, Christelle Xu, and Ivan Khvostishkov for their contributions in making this project possible. Azercell Telecom LLC, Azerbaijan’s …

image1 3Jp6s6Jmax 1000x1000 1

AI in SRE: Where and how Google is deploying agentic AI to improve operations

Since its inception over 20 years ago, Google has used Site Reliability Engineering (SRE) to keep services like Search, Gmail, Maps, YouTube and Google Cloud reliable and highly available, adhering to the principles and practices of the reliability-first mindset. Recently though, the emergence of AI has driven multiple step-changes in system complexity. Interactions between components …