Categories: Image

Nava – A 6.3B audio-video model .

Page: https://ernie-research.github.io/NAVA/
Model: https://huggingface.co/ernie-research/NAVA
Github: https://github.com/ernie-research/NAVA

NAVA is a 6.3 B-parameter joint audio-video generator that synthesizes synchronized video and audio from a single prompt — including multi-speaker speech with reference-timbre control and image-conditioned continuations.

Instead of post-hoc-aligned dual towers or fully unified tri-modal stacks, NAVA uses an Align-then-Fuse MMDiT: a dedicated alignment space first establishes audio-video correspondence, then context (text, speaker embeddings) is fused via cross-attention. On Verse-Bench it sets new SOTA on Sync-C / Sync-D / video quality / audio WER while using 2× to 5× fewer parameters than open-source baselines.

submitted by /u/AgeNo5351
[link] [comments]

Baidu unveils proprietary ERNIE 5 beating GPT-5 performance on charts, document understanding and more

Mere hours after OpenAI updated its flagship foundation model GPT-5 to GPT-5.1, promising reduced token usage overall and a more pleasant personality with more preset options, Chinese search giant Baidu unveiled its next-generation foundation model, ERNIE 5.0, alongside a suite of AI product upgrades and strategic international expansions.The goal: to…

November 14, 2025

In "AI/ML News"

We may have a new SOTA open-source model: ERNIE-Image Comparisons

Base model is definitely SOTA, can even easily compete with closed-source ones in terms of aesthetic. Cinematic quality and color grading is next level. Base model is heavily biased on Asian faces, while it excels on anime/illustration style, while my base model anime/illustration experiments wasn't that good. Higher CFG is…

April 15, 2026

In "Image"