Nava – A 6.3B audio-video model .

Nava - A 6.3B audio-video model .

Page: https://ernie-research.github.io/NAVA/
Model: https://huggingface.co/ernie-research/NAVA
Github: https://github.com/ernie-research/NAVA

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines.

submitted by /u/AgeNo5351
[link] [comments]