13 5ISo6KHihlsjf389wi8A
By Zhuoning Yuan, Ta-Ying Cheng, Benjamin Klein, Bahareh Azarnoush
At Netflix, we build technology to help storytellers bring their creative visions to life and to help members discover the stories they love.
To connect stories with diverse audiences around the world, we produce promotional assets, including trailers, teasers, and social short‑form videos, that build on and elevate the original footage. Through close collaboration with the teams crafting these assets, we identified a recurring gap in current tools. Transforming raw footage into a polished final asset often requires complex edits like seamlessly adding new visual elements, patching or replacing backgrounds, or removing unwanted objects without breaking the scene’s physical continuity. These tasks typically demand hours of specialized manual editing work. While recent generative video editing models show promise, they often struggle to preserve the integrity of the source footage. Many methods regenerate every pixel to make an edit, which can fail to isolate changes and inadvertently alter elements that should remain untouched. To execute these tasks effectively, artists need tools that empower them to dictate exactly what changes and how it changes.
Our research goal is to make this process easier for artists. We’re deliberate about where and how AI is applied, ensuring that the technology always serves the creative intent. That principle drives our recent work: exploring the benefits of generative AI in ways that protect and expand creative choice, and keeping artists in precise control of their final vision. Recent advancements in AI video editing have demonstrated impressive capabilities in streamlining complex manual editing workflows, but key challenges remain before they can reliably support professional use:
Today, we’re sharing two research explorations that aim to address these challenges. We believe this work can help advance the field in a way that’s both meaningful and responsible:
Along with this blog post, we’re also publicly releasing the research papers that detail the algorithmic innovations behind Vera and VOID. We hope these publications will enable other researchers to experiment with these ideas, build upon our findings, and further advance the field.
Existing video editing models regenerate the entire clip, coupling the intended edit with regions that should remain unchanged. This increases the risk of altering details of the original footage. To tackle this challenge, we introduce Vera, a novel layered video diffusion framework for content-preserving video editing.
Given a source video and a text editing instruction, Vera jointly generates an edit layer and an alpha matte. These layers are then seamlessly composed with the original footage to produce the final edited result. By design, Vera supports complex tasks such as object addition and background change, while ensuring that the pixels outside the edited regions from the source video remain perfectly intact.
One of the main challenges in developing Vera was the lack of suitable training data. Since no public dataset provides the high-quality layered data we need (clean input, alpha matte, edit layer, composite video), we built our own. Using a combination of existing open-source videos and human annotation, we constructed a layered video dataset with a total of 486k frames at 832×480 resolution. We organized it into three subsets of increasing complexity:
Beyond data, model design is another key challenge. The three target outputs Vera generates — an edit layer (decoupled creative edits), an alpha matte layer (a grayscale mask that depends on the edit content and scene interactions such as occlusions), and a composite layer (natural footage) — have substantially different distributions. In practice, using a single shared architecture to reconcile these differences proved data-inefficient. To address this, Vera uses a MoT (Mixture-of-Transformers) design. Instead of a single DiT, we use three separate DiTs, one for each output:
To evaluate Vera, we curated a benchmark of test video-prompt pairs: 72 for object addition and 69 for background change, using open-source videos. The test set spans a range of difficulty, including slow and fast motions, various camera motions, single and multiple objects, and both simple and complex scenes. We evaluated the performance across three complementary dimensions:
In our results, both Vera-1.3B and Vera-14B significantly outperform existing baselines on content preservation, while maintaining similar video quality and instruction compliance performance compared to strongest baselines (please see the paper for full results).
To complement automated metrics, we ran a human preference study comparing Vera against five baselines. We collaborated with 19 creative reviewers who evaluated 512 video trials in total. In each trial, reviewers were shown randomized side-by-side comparisons between the Vera model and a baseline model. The human consensus strongly aligned with our quantitative findings: Vera-1.3B was preferred over all baselines for content preservation and instruction compliance. Furthermore, reviewers rated Vera’s video quality as comparable to baselines on background change tasks, and noted a clear advantage for Vera on object addition tasks.
Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions — such as collisions with other objects — current models fail to correct them and produce implausible results. To address this, we present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios.
Given an input video, the user clicks on an object to remove. A VLM-based reasoning pipeline then analyzes the scene to identify other regions that will be causally affected, e.g., objects that will fall, collide, or change trajectory. This physical reasoning is encoded into a quadmask to guide the diffusion model:
We built on top of the Kubric simulation engine and the HUMOTO human motion capture dataset to generate synthetic counterfactual video pairs along with their corresponding quadmasks. Specifically, the counterfactual videos are generated by re-simulating the exact scene from the original video, but with the target object(s) or human removed. This resimulation creates an alternate outcome based on strict laws of physics. For example, if a person holding a lamp is removed from the scene, the simulation ensures the lamp obeys gravity and falls to the ground. The quadmasks then capture the removed object (black), the affected regions (grey), their overlaps (dark grey), and the unchanged parts of the scene (white).
During model training for VOID, we introduce two improvements over prior work: (i) quadmask conditioning, which explicitly identifies regions in each frame that may change after the object is removed, and (ii) a second-pass video appearance refiner that reduces artifacts such as unwanted object morphing. VOID is finally trained on the CogVideoX-Fun-V1.5–5b-InP backbone with Gen-Omnimatte’s checkpoint and fine-tuned for video inpainting with interaction-aware quadmask conditioning.
Experiments across both synthetic and real data demonstrate that VOID preserves consistent scene dynamics far better than prior video object removal methods (please see the paper for full results). VOID successfully maintains object structure and produces plausible motion over time across a wide variety of real-world cases. By contrast, results from both open- and closed-source baselines consistently exhibit physically inaccurate artifacts. For instance, baselines generate water splashes without human impact (see top row of the figure below) or show spinning tops being disrupted without the presence of interacting hands.
To complement our quantitative evaluation, we conducted a user study with 25 creative reviewers to measure the perceptual realism and physical plausibility of our counterfactual edits. Each participant was randomly assigned 5 out of 75 real-world scenarios, resulting in 125 total comparisons. For each video, participants viewed the original input alongside the outputs of VOID and six baselines (seven models total) in a randomized order. Participants were asked to select the video that best reflected how the scene should realistically appear after the object was removed, factoring in visual quality, temporal consistency, blending, the realism of scene evolution, and the absence of artifacts. VOID was selected 64.8% of the time, substantially outperforming all baseline models.
Applying AI in ways that serve both member and creator needs is core to our research philosophy, and these projects reflect that approach. While Vera and VOID show promising early results, reaching production-ready quality will require addressing several limitations we encountered. For example, Vera struggles with some complex effects such as lightning or smoke due to the limited training data, and in some cases, it fails to keep background motion fully consistent with the input camera movement. Despite the various generalization capabilities VOID exhibits, we still observe domain gaps. For instance, it cannot handle videos with unusual camera angles or shots captured very close to the target object, and it currently has constraints on supported video length and resolution.
These limitations motivate continued investment in this line of research. Vera and VOID are important early efforts toward making complex video editing more controllable and accessible for artists. For this work, we used publicly available datasets with additional annotation efforts for experiments, and we hope that sharing our research will encourage the broader community to build on these ideas and advance them further.
Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.
the joker stairs but it's a waterfall now 🌊 wide shots land clean, close-ups are…
Researchers say the discovery could be a “Rosetta stone” for cosmic signals.
A technology developed at the Technion enables ordinary users to create realistic video clips intuitively,…
Two years after it turned Marg Monday into a daily, the Ninja Slushi is only…
This post was co-written with Kevin Jones from Ampersend (Edge & Node) and Chethan Shriyan…