1RQuXyzwooTZcUZ5rCejOug
Saish Sali, Nipun Kumar, Sura Elamurugu
As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the industry standard, our ML teams were relatively small, and optimizing member engagement was our primary use case. Fast forward to today, and machine learning has become the backbone of Netflix’s business transformation. We now apply ML across various business domains, including:
… and a growing number of additional use cases across the company
Each domain operates with a different tech stack, different business metrics, and a distinct organizational structure. While this diversity is a testament to how machine learning has evolved to drive value across many verticals at Netflix, this growth introduces a new challenge: enabling cross-pollination of models and data across domains.
As our ML investments scaled across these domains, a critical problem emerged: the models produced largely became black boxes. Without any discovery infrastructure, ML practitioners couldn’t easily collaborate or share work across business verticals.
Consider a concrete example: content embeddings. Our Studio teams create sophisticated embeddings that identify scene boundaries, detect visual transitions, and understand content structure. These embeddings were originally built for production workflows.
But those same embeddings could be incredibly valuable elsewhere. Ads could hypothetically use content embeddings for context matching (ensuring advertisements align with the tone and content of what’s currently playing). Personalization could leverage them for episodic merchandising and recommendations (matching the topic or mood of an episode with a user’s preferred viewing preferences). Yet making this cross-pollination happen is extraordinarily difficult.
Why? Our ML tools exist in silos, each with its own backend services and user interface. The model registry is unaware of which A/B tests were using its models, and the pipeline orchestrator is unaware of downstream model dependencies. ML practitioners have to traverse multiple systems to answer basic questions about their work. Finding a model requires opening the model registry, understanding its lineage means switching to the pipeline orchestrator, and tracking which A/B tests use that model requires navigating to the experimentation platform. This fragmentation prevents practitioners from answering critical questions:
The real challenge wasn’t just building a consolidated UI. We needed to connect the different pieces of infrastructure our ML practitioners were using to perform different parts of the ML lifecycle.
Our ML ecosystem generates metadata from dozens of sources:
Each system employs different formats, identifiers, and mental models. The hard technical problem we had to solve was: How do we collect this heterogeneous metadata, transform it into a unified entity model, and build a connected graph that enables true exploration and collaboration across business domains?
Our answer was the Metadata Service (MDS), which builds a Model Lifecycle Graph that indexes and connects ML-related entities across Netflix. MDS is optimized for real-time ingestion of ML metadata (e.g., models, features, pipelines, experiments, datasets) and to answer cross-domain questions such as “Which experiments are running this model?” or “Which models share these features?” It is the foundation that enables discovery, ingesting events from diverse sources, enriching them with context, and materializing relationships across entities.
Our vision: to make every ML asset at Netflix discoverable, understandable, and reusable by every ML practitioner, regardless of their team or domain.
Before diving into the technical implementation, it’s helpful to understand the conceptual model that underpins MDS. This vocabulary enables consistent communication across teams and systems:
Component: Any object that is uniquely addressable using an AI Platform’s (AIP) Uniform Resource Identifier (URI). An AIP URI follows the formataip://<componentType>/<platformId>/<resourceId>, ensuring global uniqueness. For example:
Entity: A component within the ML ecosystem, characterized by additional properties such as name, description, creation date, and owners. Entities represent ML-specific assets, such as models, features, and pipelines.
Entity Type: A group of entities that share the same data shape. A data shape is a set of property constraints that specify the attributes and relationships an entity must have.
Domain: A functional grouping of related entity types that defines the abstract interface for a category of ML assets. For example, the Models domain defines what a Model and Model Instance look like, while the Pipelines domain defines Schedules, Requests, and Executions.
Provider: A concrete implementation of a domain, backed by a specific source system. For example, the Models domain is currently backed by our internal model registry. This separation allows MDS to support multiple providers for the same domain. If a new model registry were introduced, it could be added as an additional provider without changing the domain interface.
We can summarize these concepts with a concrete example:
This URI-based addressing scheme is crucial as it allows any service to reference any ML asset with a single string, and MDS can resolve that reference back to rich, connected metadata.
The journey from raw system events to a queryable graph happens in stages. Let’s walk through each with a concrete example: connecting a model to its A/B tests through relationship inference.
MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time. Source systems emit thin events that include an identifier and an event type.
Example event:
{
"event_type": "model_instance_created",
"instance_id": "ranking-model-v5-20XX0101",
...
} This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements.
Each source system has dedicated event handlers in MDS:
MDS implements a hydration contract for each event type. When an event arrives, MDS:
This design has a crucial property: the order of events doesn’t matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes.
This notification of change pattern has a few important tradeoffs. On the plus side, it keeps producers simple, makes us robust to out-of-order or dropped events, and ensures that MDS can always reconcile to the latest state by reading from the source of truth. The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don’t overload them.
For our ranking model example, when the model_instance_created event arrives, MDS calls the Model Registry API: GET /api/v1/instances/ranking-model-v5-20XX0101
The registry responds with a full descriptor. Example response (key fields only):
{
"id": "ranking-model-v5-20XX0101",
"pipeline_run_id": "train-weekly-ranking-20XX0101",
"owner_emails": ["alice@netflix.com"],
"labels": [{"key": "team", "value": "personalization"}],
...
} Raw events are heterogeneous and each source system has its own schema and semantics. MDS workers transform these events into a unified entity model with standardized fields.
Without normalization, downstream consumers would need to understand every source system’s schema. Normalization creates a consistent interface, allowing queries and relationships to work across all entity types. Here is an example.
Normalized MDS entity:
{
"id": "aip://model/registry/ranking-model-v5-20XX0101",
"pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
"entity_type": "ModelInstance",
"owners": ["aip://user/identity/alice"],
"tags": [{"tag": "team", "value": "personalization"}],
...
} The normalization process standardizes field names and formats. For example, platform-specific IDs become global AIP URIs, owner_emails becomes owners with resolved user URIs, and labels become tags. Foreign keys like pipeline_run_id are transformed into entity references. However, there’s still no reference to which A/B tests are using this model. The Model Registry doesn’t track experiments, and the Experimentation Platform doesn’t track which pipeline produced a given model. This is where knowledge enrichment becomes critical.
Once normalized, entities are persisted to Datomic and immediately indexed in Elasticsearch. This happens synchronously within the event processing flow.
Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as both a local cache and a graph database.
Why Datomic? Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state.
What we store:
This enables:
In practice, we use Datomic for relationship-heavy, navigational queries such as:
These queries often span multiple hops in the graph and benefit from Datomic’s immutable fact model and efficient joins across entity relationships.
Elasticsearch for Discovery
Immediately after writing to Datomic, entities are indexed in Elasticsearch to power fast, full-text search across the catalog.
What we index:
Index structure:
This enables:
Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page. Indexing happens in near real-time as part of the ingestion and enrichment workflows, so changes are usually visible in the Portal with a short delay that is acceptable for interactive use.
Once entity metadata is persisted in Datomic, scheduled background processes take over to discover and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist only as references without full metadata).
The enrichment workflow:
This asynchronous approach allows MDS to handle the computational cost of graph formation without blocking real-time event ingestion. It also enables retry logic and gradual enrichment as new entities become available.
Because enrichment is asynchronous, newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it’s safe to rely on a particular relationship for debugging or impact analysis.
Why enrich? Source systems are purpose-built and don’t know about entities in other domains. Enrichment discovers and materializes cross-system relationships that enable powerful lineage and impact queries.
When MDS processes a new model instance, background enrichment jobs discover relationships through multi-hop inference:
Step 1: Direct link to pipeline
The model references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B test associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101
Response:
{
"run_id": "train-weekly-ranking-20XX0101", "pipeline": "weekly-ranking-trainer",
"ab_test_cells": [
{"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
]
...
} Step 2: Discover A/B test context
The enrichment job discovers the pipeline ran for A/B test cell #2 and queries the Experimentation Platform for test details: GET /api/v1/tests/12345
{
"test_id": "12345",
"name": "Ranking Model v5 vs v4",
"status": "ACTIVE",
"cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
...
} Step 3: Infer transitive relationships
The enrichment job now has the complete chain:
The job writes the inferred relationship back to Datomic and triggers re-indexing, and materializes these edges in the graph. MDS doesn’t just store what it’s told; it derives new knowledge by walking the graph in the background.
Why this matters: Without MDS, answering “Which A/B tests are using this model?” requires:
With the model lifecycle graph, it’s a single query:
query {
model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
name
owners { name }
currentInstance {
version
pipeline {
name
owners { name }
}
features {
edges {
node {
name
data { edges { node { name } } }
}
}
}
associatedAbTests {
name
cells { number name }
}
}
}
} The reverse query also works: “What models are being tested in experiment 12345?”
With the Model Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t just about finding a model; It’s about traversing relationships:
For example, imagine an engineer investigating a degraded engagement metric for a personalization model. They might:
Before MDS and the Model Lifecycle Graph, this required manual checks across multiple tools (model registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.
This graph-based exploration answers questions that were previously impossible:
Every entity has deep context: its creation time, ownership, update history, and most importantly, its relationships to other entities.
The Model Lifecycle Graph is surfaced to practitioners through the AIP Portal, a unified interface that provides full-text search across all entity types, detailed entity pages with navigable relationships, and personalized views for teams and individuals.
A typical interaction in the AIP Portal looks like:
When new entity types are introduced into MDS, the portal automatically provides baseline search, entity pages, and relationship navigation, and we can then layer on domain-specific visualizations (such as model deployment history or dataset version timelines) over time.
Building the ML lifecycle graph is an ongoing journey. Significant challenges remain, and these represent the future opportunities for us:
This work represents the collective effort of stunning colleagues across the AI Platform organization: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Olenik, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu
Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
submitted by /u/Murky_Foundation5528 [link] [comments]
Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with…
Business leaders across industries rely on operational dashboards as the shared source of truth that…
OpenAI’s cofounder and president revealed in federal court on Monday that he’s one of the…
A research team led by Virginia Tech cybersecurity expert Bimal Viswanath has found a critical…