Categories: Image

Model Drop | ZIT + LTX 2.3 + Music Video | Arca Gidan contest

The idea came from something I’m pretty sure most of us live every single day: you wake up, check your phone, and another model has dropped. Open source, closed source, whatever source — faster, smarter, more creative, more powerful. And before you’ve even had coffee, you’re already reworking a ComfyUI workflow that was perfectly fine yesterday. That loop of FOMO is what this song is about. Maybe the one or the other can relate to that feeling.

I wrote the lyrics first, then used Suno AI to turn them into a track. That became the creative baseline.

Shot List

With the song done, I went through it verse by verse — every chorus, every pre-chorus, every bridge — and for each section I came up with 3 to 5 possible shots. Where is our main character? What’s the camera angle? What’s the situation? What does this line actually look like as an image? That process gives you a kind of ordered visual setlist that maps directly onto the song structure. You always know what you need and where it goes.

Character (No LoRA)

For the main character I used Z Image Turbo. No LoRA, no training — just consistent prompting. The turbo architecture works in our favour here: because it’s a more constrained model, keeping the character description locked across prompts produces surprisingly similar results, which creates the illusion of a consistent character across dozens of images. I kept the description identical every time and only changed the background, camera angle, and expression. Effective and fast.

Image Generation

Once the shot list was complete I had a massive prompt list covering every scene. I ran all of them through ComfyUI overnight — or longer, depending on the count. Two categories of images: B-roll shots from the setlist, and medium-to-close-up shots specifically for the lip-sync sections.

ZIT Workflow I used from another reddit post: RED Z-Image-Turbo + SeedVR2 = Extremely High Quality Image Mimic Recreation. Great for Avoiding Copyright Issues and Stunning image Generation. : r/comfyui (I did use the ZIT Model not the RED version nor the Mimic Part of the WF)

Image to Video

All the generated stills went into LTX img2video inside ComfyUI to bring them to life. For the lip-sync sections I used LTX I2V synced to the audio track. Since LTX caps out at 20 seconds per render, everything gets generated in chunks and stitched together in post.

The close-up rule matters: the further the camera is from the character, the worse LTX renders the lip sync. Medium shot is the minimum — anything wider and quality degrades fast.

The workflow I used mainly: PSA: Use the official LTX 2.3 workflow, not the ComfyUI included one. It’s significantly better. : r/StableDiffusion

Final Edit

No Premiere Pro, no DaVinci — just InShot on my phone. I build the full lip-sync timeline first so it covers the whole song, then layer the B-roll clips over the top to fill the gaps and add visual depth.

That’s the whole pipeline: idea → lyrics → song → shot list → character → images → animation → edit. The video Fully local, fully open source, built over a couple of nights on a 3090.

Hope you enjoy it.

Assets & Workflows

You can find the workflow files and a full written guide over on the Arca Gidan page if you want to dig into the details.

https://arcagidan.com/entry/d2cae0b9-3d38-4959-b1b5-36ea60f34438

Honestly, what a challenge to be part of. Seeing what everyone came up with — the concepts, the creativity, the sheer variety of approaches — was genuinely inspiring. This is exactly the kind of community that makes local AI worth pursuing. Really glad I got to be a part of it. 🙌

submitted by /u/Ok-Wolverine-5020
[link] [comments]

HiDream-I1: New Open-Source Base Model

HuggingFace: https://huggingface.co/HiDream-ai/HiDream-I1-Full GitHub: https://github.com/HiDream-ai/HiDream-I1 From their README: HiDream-I1 is a new open-source image generative foundation model with 17B parameters that achieves state-of-the-art image generation quality within seconds. Key Features ✨ Superior Image Quality - Produces exceptional results across multiple styles including photorealistic, cartoon, artistic, and more. Achieves state-of-the-art HPS v2.1…

April 8, 2025

In "Image"