Showcase

What Is LTX 2.3: The 22B Open-Source Audio-Video Model Explained

LTX 2.3 is a 22B parameter DiT-based audio-video foundation model from Lightricks. Here's what makes it different and why it matters for ComfyUI users.

By ltx workflow

Editor's Note: What is LTX 2.3 — the 22B parameter DiT-based audio-video foundation model from Lightricks, explained for ComfyUI users. Source: crepal.ai

Hi, I’m Dora. I was testing a prompt in CrePal last week when someone in Discord I follow dropped a link to a new model release — LTX 2.3. The thread blew up. “Free. Local. 22B. Audio + video in one pass.” I stopped what I was doing and went down the rabbit hole for the next two hours. Here’s everything I found throughout the process.

What LTX 2.3 is (22B DiT, dev vs distilled variants)

screenshot 1

LTX 2.3 is a 22-billion-parameter DiT-based (Diffusion Transformer) audio-video foundation model from Lightricks. It’s fully open-source under Apache 2.0, which means free for most creators — including commercial use for organizations under $10M annual revenue.

The big deal: it generates both video and audio together in a single pass. The sound, the motion, and the visuals all come from the same model at the same time. Most tools you’ve used separate those two steps. LTX 2.3 doesn’t.

The base model uses Google’s Gemma 3 12B as its text encoder — the part that reads your prompt and figures out what to generate.

Key Specs at a Glance

What’s New vs LTX 2 (upscaler, audio, monorepo, Desktop App)

LTX 2 established architecture. LTX 2.3 is a point release, but the improvements are structural, not cosmetic.

Rebuilt VAE. The most consequential change is a rebuilt VAE with a redesigned latent space. In practice this means sharper fabric, cleaner hair, and stable chrome reflections during camera moves — the kinds of fine-detail tests that previous open-source models failed consistently.

4x larger text connector. Multi-subject prompts with specific spatial relationships — “a red car parked behind a white truck at night” — now hold across the full clip rather than drifting after 3–4 seconds. This was one of the most frustrating failure modes in earlier versions.

Cleaner audio. Audio quality is cleaner thanks to a new HiFi-GAN vocoder. Because LTX 2.3 produces audio within the same diffusion pass as the video, a door slam lands on exactly the right frame. It’s synchronized at the model level.

Native portrait mode. Instead of generating landscape video and cropping it later, LTX 2.3 can generate vertical video directly — especially useful for YouTube Shorts, Instagram Reels, and TikTok.

LTX Desktop App.The LTX Desktop app provides a free, local generation workspace with text-to-video, image-to-video, and timeline tools — no cloud account needed. This shipped alongside 2.3 and is probably the fastest way to get started without touching Python.

Monorepo + LoRA training. Lightricks provides reproducible LoRA and IC-LoRA training through the LTX-2 Trainer, with motion, style, and likeness training completing in under an hour in many configurations.

What You Can Generate

LTX 2.3 handles the full generation stack in one model:

Text-to-video (T2V): Describe a scene, get a clip with matched audio.
Image-to-video (I2V): Start from a still image and animate it. Note: I2V has known instability bugs in the current release — the model occasionally freezes or over-applies the Ken Burns effect.
Multi-stage / chained generation: Generate a base clip, upscale it, chain sequences together for longer narratives.
Portrait video: Native 9:16 output up to 1080×1920 — no cropping required. What you can’t do well yet: complex crowd physics, water simulation, and emotional tonal subtlety in faces still lag behind top closed systems like Veo 3.1. This isn’t a knock — it’s just honest.

Free Access Paths (Online vs Local)

Local (truly free, unlimited):

The base optimized package in FP8 format weighs about 18–20GB. This includes the diffusion weights, the VAE, and the text encoders. The full uncompressed BF16 version for server hardware takes up around 40GB.

The official requirements ask for 32GB VRAM and Python 3.12+ with CUDA 12.7+. In practice, the community has pushed this lower. One user ran the Q4_K_S GGUF on an RTX 3080 (10GB VRAM) and got a 960×544, 5-second clip with audio in about 2–3 minutes — accepting roughly 5–8% softening in fine detail versus the BF16 baseline, but with identical motion coherence.

Mac users: Apple Silicon M2/M3/M4 can use the Metal framework (MPS) for hardware acceleration. Expected generation time for a 10-second 1080p clip on an M3 Max is ~4–6 minutes. It works, but it’s slower than a CUDA GPU.

LTX Desktop App (easiest en- World’s First Video Creation Agent

CrePal turns your ideas into stunning visuals in minutes.
Real-time Preview CrePal’s Work

Sources

What Is LTX 2.3 — crepal.ai

#ltx-2.3#showcase#overview#22b#comfyui