| | Zeev, CEO of LTX, here. Wanted to pull back the curtain on the technical bets we’re making and where they’re headed. Happy to go deep in the comments. We’ve been heads down on the next generation of LTX, and I want to share what’s coming. Not the long-term vision post (that’s coming separately), just a concrete look at what we’re building right now and what you’ll see soon. The next release of LTX-2 is focused on generation quality across the board. As usual, more data, more compute, and this time around two architectural flavors: a dense model and the mixture-of-experts to accommodate different speed and quality trade-offs. The mixture-of-experts (MoE) is a fundamental architectural shift where the model activates only the parts it needs for a given generation. This lets us scale capability and quality without paying for it linearly in compute. It’s the kind of change that doesn’t show up in a single demo but fundamentally changes what the model can do at a given cost. With both dense and MoE, we are going to ship a significantly more capable text encoder. The result is a model that better understands what you wrote, including complex, multi-shot prompts that older architecture tended to flatten or ignore. We are also investing heavily in performance and memory: newer attention kernels and improved low-precision support mean the latest model runs well across a wider range of hardware. Now, the part I think this community will really care about as well. We’re opening up more of the training infrastructure: new trainer recipes and LoRA training tooling so you can build domain-specific model variants on top of LTX, not just use the base weights as-is. Think specialized flavors for use cases like human motion, product visualization, and architectural environments, each fine-tuned from the same foundation but optimized for a specific domain. On the enterprise side, this extends into a post-training customization layer that lets teams fine-tune on proprietary data without retraining from scratch. The full picture is three tiers: a base foundation model, domain-specific trainer configurations, and a customer customization layer on top. To be clear: we’re committed to keeping the weights open. The base model, the derivatives, the tooling. This isn’t a bait-and-switch where we open-source early and close up once the model gets good enough to monetize. Openness is how we build, and the community building on top of our models will always reach further than any single team working alone. One more thing we’re exploring, and we think it could be a real leap in output quality: a diffusion-based decoder that replaces the traditional VAE for converting latents back into pixels. The potential is sharper, higher-resolution output that combines decoding and upscaling into a single step. We’re actively experimenting with it in our latent space. This is the kind of architectural bet that could change the standard of video generation and we hope open models will lead it. We also know the model is only half the story. There’s still a real gap between “the model works” and “I can ship a finished product on this,” and closing it matters as much to us as any model improvement. We are overhauling our documentation and launching reference implementations to show exactly what good deployment looks like in practice. More to come soon. In the meantime, tell us what you want us to prioritize. — Zeev submitted by /u/ltx_model |
Text classification typically boils down to scenarios where a product review is "positive" or "negative",…
Many companies have large volumes of paper or electronic documents that contain untapped business intelligence.…
At Google Cloud, we’re committed to providing the most advanced, secure, and private infrastructure for…
The generative features in iOS 27’s new Photos app will add fake pixels to some…
As artificial intelligence, cloud computing and digital services continue to expand, the world is facing…
Greetings everyone! My img2img workflow seemed to go over well so I decided to take…