| | Page: https://ernie-research.github.io/NAVA/ NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations. Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines. submitted by /u/AgeNo5351 |
Editor’s Note: This blog post was written by Greg Little, Senior Counselor at Palantir, with…
By Oleksii Tkachuk, Kartik Sathyanarayanan, Rajiv ShringiIntroductionNetflix has a diverse range of graph use cases, each…
Deploying large language models (LLMs) at scale on Amazon SageMaker AI Inference makes observability a…
Welcome to the second Cloud CISO Perspectives for May 2026. Today, Usman Chaudhary, Field CISO,…
Dads are traditionally tough to shop for—let me help with these handpicked gift ideas for…
Artificial intelligence chatbots need to work on their social judgment, recent events suggest. At one…