| | Page: https://ernie-research.github.io/NAVA/ NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations. Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines. submitted by /u/AgeNo5351 |
Monitoring and troubleshooting generative AI inference endpoints operating at scale is challenging. When your large…
A year ago, Simon Willison wrote one of the cleanest definitions of an agent that…
The UK’s 5-million-plus small and midsize businesses and enterprises (SMBs) are the backbone of our…
Today, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now…
The United Kingdom, and London in particular, continues to be one of the great hubs…
Days before Anthropic took its most advanced AI models offline, the White House ordered the…