SpecMD: A Comprehensive Study on Speculative Expert Prefetching

Mixture-of-Experts (MoE) models enable sparse expert activation, meaning that only a subset of the model’s parameters is used during each inference. However, to translate this sparsity into practical performance, an expert caching mechanism is required. Previous works have proposed hardware-centric caching policies, but how these various caching policies interact with each other and different hardware …

Tomofun Archtecture 1 1024x949 1

Cost effective deployment of vision-language models for pet behavior detection on AWS Inferentia2

Tomofun, the Taiwan-headquartered pet-tech startup behind the Furbo Pet Camera, is redefining how pet owners interact with their pets remotely. Furbo combines smart cameras with AI to detect behaviors such as barking, running, or unusual activity, and alerts owners in real time. At the core of this capability are computer vision and vision-language models that …

2 System diagrammax 1000x1000 1

Pioneering AI-assisted code migration: How Google achieved 6x faster migration from TensorFlow to JAX

AI coding agents are rapidly becoming ubiquitous across the software industry, fundamentally changing how developers write, test, and debug daily code. While these tools excel at localized, self-contained tasks, applying them to massive, systemic codebase migrations requires an entirely new approach. Google is already addressing this challenge by incorporating AI into many migration workflows: x86 …

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal …

ML 19061 image 1

How Hapag-Lloyd uses Amazon Bedrock to transform customer feedback into actionable insights

Hapag-Lloyd stands as one of the world’s leading liner shipping companies, operating a modern fleet of 313 container ships with a total transport capacity of 2.5 million TEU (Twenty-foot Equivalent Unit—a standard unit of measurement for cargo capacity in container shipping). The company maintains a container capacity of 3.7 million TEU, which includes one of …

Five must-have guides to move agents into production with Gemini Enterprise Agent Platform

Building AI agents that work well in a demo is one thing, but running them in production requires serious infrastructure.  At Google Cloud Next ’26, we introduced Gemini Enterprise Agent Platform to help developers build, deploy, scale, govern, and optimize  autonomous AI agents. From managing long-running state and enforcing security with the Agent Governance Stack, …

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents using outcome-only rewards suffers from credit-assignment ambiguity, obscuring which intermediate steps (or tool-use decisions) lead to success or failure. In this paper, we propose PORTool, an importance-aware policy-optimization algorithm that reinforces agents’ …

1RQuXyzwooTZcUZ5rCejOug

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Saish Sali, Nipun Kumar, Sura Elamurugu Introduction As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the …

image ml 208081 new

Beyond BI: How the Dataset Q&A feature of Amazon Quick powers the next generation of data decisions

Business leaders across industries rely on operational dashboards as the shared source of truth that their teams execute against daily. But dashboards are built to answer known questions. When teams need to explore further, ad-hoc, multi-dimensional, or unforeseen questions, they hit a bottleneck. They wait hours or days for BI teams to build new views …

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

This paper was accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026. Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, …