Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all …

1 MoqLhu01wArcD1n58SZyg

Smarter Live Streaming at Scale: Rolling Out VBR for All Netflix Live Events

By Renata Teixeira, Zhi Li, Reenal Mahajan, and Wei Wei On January 26, 2026, we flipped an important switch for Live at Netflix: all Live events are now encoded using VBR (Variable Bitrate) instead of CBR (Constant Bitrate). It sounds like a small configuration change, but it required us to revisit some of the foundational assumptions …

ml 20566 image 1

Simulate realistic users to evaluate multi-turn AI agents in Strands Evals

Evaluating single-turn agent interactions follows a pattern that most teams understand well. You provide an input, collect the output, and judge the result. Frameworks like Strands Evaluation SDK make this process systematic through evaluators that assess helpfulness, faithfulness, and tool usage. In a previous blog post, we covered how to build comprehensive evaluation suites for …

How Honeylove boosts product quality and service efficiency with BigQuery

Building the perfect bra takes thousands of data points. That’s why Honeylove isn’t just another intimates brand. We’re a technology company that happens to make exceptional bras, tops, shapewear, and bodysuits. Technology shapes everything we do, from how we iterate garments based on customer feedback to how we optimize sizing across those thousands of data …

nishant

Automating competitive price intelligence with Amazon Nova Act

Monitoring competitor prices is essential for ecommerce teams to maintain a market edge. However, many teams remain trapped in manual tracking, wasting hours daily checking individual websites. This inefficient approach delays decision-making, raises operational costs, and risks human errors that result in missed revenue and lost opportunities. Amazon Nova Act is an open-source browser automation …

1 1B5SFVymax 1000x1000 1

Run real-time and async inference on the same infrastructure with GKE Inference Gateway

As AI workloads transition from experimental prototypes to production-grade services, the infrastructure supporting them faces a growing utilization gap. Enterprises today typically face a binary choice: build for high-concurrency, low-latency real-time requests, or optimize for high-throughput, “async” processing. In Kubernetes environments, these requirements are traditionally handled by separate, siloed GPU and TPU accelerator clusters. Real-time …

ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts

We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and …

ML20076 image 1

Build reliable AI agents with Amazon Bedrock AgentCore Evaluations

Your AI agent worked in the demo, impressed stakeholders, handled test scenarios, and seemed ready for production. Then you deployed it, and the picture changed. Real users experienced wrong tool calls, inconsistent responses, and failure modes nobody anticipated during testing. The result is a gap between expected agent behavior and actual user experience in production. …

Entropy-Preserving Reinforcement Learning

Policy gradient algorithms have driven many recent advancements in language model reasoning. An appealing property is their ability to learn from exploration on their own trajectories, a process crucial for fostering diverse and creative solutions. As we show in this paper, many policy gradient algorithms naturally reduce the entropy—and thus the diversity of explored trajectories—as …

ML 19682 image 1

How Ring scales global customer support with Amazon Bedrock Knowledge Bases

This post is cowritten with David Kim, and Premjit Singh from Ring. Scaling self-service support globally presents challenges beyond translation. In this post, we show you how Ring, Amazon’s home security subsidiary, built a production-ready, multi-locale Retrieval-Augmented Generation (RAG)-based support chatbot using Amazon Bedrock Knowledge Bases. By eliminating per-Region infrastructure deployments, Ring reduced the cost …