Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish Rastogi
Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.
Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. Previous work has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual shots across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as ads relevancy, clip popularity prediction, and clip tagging.
For these reasons, we developed the Netflix Media Foundational Model (MediaFM), our new, in-house, multimodal content embedding model. MediaFM is the first tri-modal (audio, video, text) model pretrained on portions of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate rich, contextual embeddings¹ for shots from our catalog by learning the temporal relationships between them through integrating visual, audio, and textual information. The resulting shot-level embeddings are powerful representations designed to create a deeper, more nuanced, and machine-readable understanding of our content, providing the critical backbone for effective cold start of newly launching titles in recommendations, optimized promotional assets (like art and trailers), and internal content analysis tools.
The model’s fundamental unit of input is a shot, derived by segmenting a movie or episode (collectively referred to as “title”) using a shot boundary detection algorithm. For each shot, we generate three distinct embeddings from its core modalities:
For each shot, the three embeddings² are concatenated and unit-normed to form a single 2304-dimensional fused embedding vector. The transformer encoder is trained on sequences of shots, so each example in our dataset is a temporally-ordered sequence of these fused embeddings from the same movie or episode (up to 512 shots per sequence). We also have access to title-level metadata which is used to provide global context for each sequence (via the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (such as synopses and tags) through the text-embedding-3-large model.
The core of our model is a transformer encoder, architecturally similar to BERT. A sequence of preprocessed shot embeddings is passed through the following stages:
We train the model using a Masked Shot Modeling (MSM) objective. In this self-supervised task, we randomly mask 20% of the input shot embeddings in each sequence by replacing them with a learnable [MASK] embedding. The model’s objective is to predict the original, unmasked fused embedding for these masked positions. The model is optimized by minimizing the cosine distance between its predicted embedding and the ground-truth embedding for each masked shot.
We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s worth noting that the switch to Muon resulted in noticeable improvements.
To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes). Most of the tasks are clip-level, i.e., each example is a short clip ranging from a few seconds to a minute which are often presented to our members while recommending a title to them. When embedding these clips, we find that “embedding in context”, namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip.
Our embeddings are foundational and we find that they bring value to applications across Netflix. Here are a few:
It’s worth noting that for the tasks above (as well as other tasks that use our model), the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion. Many of the improvements are also in various stages of deployment.
Figure 2³ compares MediaFM to several strong baselines:
On all tasks, MediaFM is better than the baselines. Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context. We look further into this next.
MediaFM’s primary improvements over previous Netflix work stem from two key areas: combining multiple modalities and learning to contextualize shot representations. To determine the contribution of each factor across different tasks, we compared MediaFM to a baseline. This baseline concatenates the three input embeddings, essentially providing the same complete, shot-level input as MediaFM but without the contextualization step. This comparison allows us to isolate which tasks benefit most from the contextualization aspect.
Additional modalities help somewhat for tone but the main improvement comes from contextualization.
Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance.
For clip retrieval we see a natural progression of around 15% for each improvement.
MediaFM presents a way to learn how to fuse and/or contextualize shot-level information by leveraging Netflix’s catalog in a self-supervised manner. With this perspective, we are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like Qwen3-Omni, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations.
Next in this series of blog posts, we will present our method to embed title-level metadata and adapt it to our needs. Stay tuned!
We would like to thank Matt Thanabalan and Chaitanya Ekanadham for their contributions to this work.
MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
After a deeply introspective and emotional journey, I fine-tuned SDXL using old family album pictures…
AI agents , or autonomous systems powered by agentic AI, have reshaped the current landscape…
Reasoning and planning are the bedrock of intelligent AI systems, enabling them to plan, interact,…
Critical labor shortages are constraining growth across manufacturing, logistics, construction, and agriculture. The problem is…
This soundbar is just the beginning, with the option to add wireless bookshelf speakers or…
A research team led by Professor Taesung Kim from the School of Mechanical Engineering at…