By Dao Mi, Pablo Delgado, Ryan Berti, Amanuel Kahsay, Obi-Ike Nwoke, Christopher Thrailkill, and Patricio Garza
At Netflix, data engineering has always been a critical function to enable the business’s ability to understand content, power recommendations, and drive business decisions. Traditionally, the function centered on building robust tables and pipelines to capture facts, derive metrics, and provide well modeled data products to their partners in analytics & data science functions. But as Netflix’s studio and content production scaled, so too have the challenges — and opportunities — of working with complex media data.
Today, we’re excited to share how our team is formalizing a new specialization of data engineering at Netflix: Media ML Data Engineering. This evolution is embodied in our latest collaboration with our platform teams, the Media Data Lake, which is designed to harness the full potential of media assets (video, audio, subtitles, scripts, and more) and enable the latest advances in machine learning, including latest transformer model architecture. As part of this initiative, we’re intentionally applying data engineering best practices — ensuring that our approach is both innovative and grounded in proven methodologies.
Traditional data engineering at Netflix focused on building structured tables for metrics, dashboards, and data science models. These tables were primarily structured text or numerical fields, ideal for business intelligence, analytics and statistical modeling.
However, the nature of media data is fundamentally different:
As our studio operations (see below) expanded, we saw the need for a new approach — one that could provide centralized, standardized, and scalable access to all types of media assets and their metadata for both analytical and machine learning workflows.
Enter Media ML Data Engineering — a new specialization at Netflix that bridges the gap between traditional data engineering and the unique demands of media-centric machine learning. This role sits at the intersection of data engineering, ML infrastructure, and media production. Our mission is to provide seamless access to media assets and derived data (including outputs from machine learning models) for researchers, data scientists, and other downstream data consumers.
This new role is essential for bridging the gap between creative media workflows and the technical demands of cutting-edge ML.
To enable the next generation of media analytics and machine learning, we are building the Media Data Lake at Netflix — a data lake designed specifically for media assets at Netflix using LanceDB. We have partnered with our data platform team on integrating LanceDB into our Big Data Platform.
Our initial focus this past year has been on delivering a “data pond” — a mini-version of the Media Data Lake targeted at video/audio datasets for early stage model training, evaluation and research. All data for this phase comes from AMP, our internal asset management system and annotation store, and the scope is intentionally small to ensure a solid, extensible foundation could be built while introducing a new technology into the company. We are able to perform data exploration of the raw media assets to build up an intuitive understanding of the media via lightweight queries to AMP.
One of the most exciting developments is the rise of media tables — structured datasets that not only capture traditional metadata, but also include the outputs of advanced ML models.
These media tables power a range of innovative applications, such as:
These tables built on top of LanceDB are designed to scale, support complex queries, and serve both research and other data science & analytical needs.
Media ML Data Engineering is a team sport. Our data engineers partner with domain experts, data scientists, ML researchers, upstream business ops and content engineering teams to ensure our data solutions are fit for purpose. We also work closely with our friendly platform teams to ensure technological breakthroughs that are beneficial beyond our small corner of the universe could become horizontal abstractions that benefit the rest of Netflix. This collaborative model enables rapid iteration, high data quality, innovative use cases and technology re-use.
The evolution from traditional data engineering to Media ML data engineering — anchored by our media data lake — is unlocking new frontiers for Netflix:
As we continue to grow the media data lake, be on the lookout for subsequent blog posts sharing our learnings and tools with the broader media ml & data engineering community.
From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
Hey folks, I’ve been playing around with animating Pokémon cards, just for fun. Honestly I…
AI is brilliant at polishing and rephrasing. But like a child with glitter glue, you…
In March 2025 the Earth was hit by a fast radio burst as energetic as…
Despite decades of progress, most robots are still programmed for specific, repetitive tasks. They struggle…
I took a stab at recreating that person's work but including a workflow. Workflow download…
We introduce SlowFast-LLaVA-1.5 (abbreviated as SF-LLaVA-1.5), a family of video large language models (LLMs) offering…