ResearchJanuary 1, 2025

LTX-Video Research: The Architecture Behind LTX 2.3

A deep dive into the LTX-Video architecture — transformer design, distillation approach, FP8 quantization, and what makes LTX 2.3 fast and high-quality.

By ltx workflow

Editor's Note: This article analyzes the research behind LTX-Video, the model family underlying LTX 2.3. We cover the key technical contributions including the transformer architecture, distillation approach, and what enables fast high-quality video generation on consumer hardware.

LTX-Video Research: The Architecture Behind LTX 2.3

LTX-Video is Lightricks' open-source video generation model, built on a novel transformer architecture designed for real-time video synthesis. Understanding the research behind it helps explain why LTX 2.3 achieves its unique combination of speed and quality.

Core Architecture

Video Transformer (DiT-based)

LTX-Video uses a Diffusion Transformer (DiT) architecture adapted for video generation. Unlike UNet-based approaches, the transformer processes video as a sequence of spatiotemporal tokens, enabling:

Global attention: Every frame can attend to every other frame
Scalability: The 22B parameter model scales efficiently with compute
Temporal coherence: Motion consistency is maintained across the full sequence

Latent Space Design

The model operates in a compressed latent space using a custom VAE (the taeltx2_3.safetensors file). Key properties:

High compression ratio: Reduces video to a compact latent representation
Temporal awareness: The VAE encodes motion information, not just per-frame appearance
Efficient decoding: Fast reconstruction from latents to pixel space

Text Conditioning

LTX-Video uses a large language model encoder for text conditioning, enabling:

Fine-grained prompt following
Understanding of complex scene descriptions
Consistent style and subject adherence across frames

Distillation: The Key to Speed

The distilled variant of LTX 2.3 is the result of consistency distillation — a technique that trains a student model to match the output of a multi-step teacher model in far fewer steps.

How It Works

Teacher model: The full dev model, requiring 20-50 denoising steps
Distillation training: The student learns to produce equivalent output in 8 steps
CFG=1: The distilled model is trained without classifier-free guidance, eliminating the need for negative prompts and halving inference compute

Quality vs Speed Trade-off

The distilled model achieves:

8x fewer steps than the dev model
Near-identical quality for most prompts
No fine-tuning support — the distillation process makes the model less amenable to LoRA training

This is why LoRA must be applied to the dev model, not the distilled checkpoint.

FP8 Quantization

The official FP8 variants use 8-bit floating point quantization to reduce model size from ~42GB to ~29GB while maintaining quality:

Transformer-only quantization: Only the transformer weights are quantized; the VAE remains in full precision
Input scaling: Activations are scaled before quantization to minimize precision loss
Hardware requirement: RTX 40-series GPUs have native FP8 tensor cores; older GPUs emulate FP8 in software (slower)

Spatial and Temporal Upscalers

The upscaler models are separate lightweight networks trained to:

Spatial upscaler: Increase spatial resolution (x1.5 or x2) while preserving temporal consistency
Temporal upscaler: Interpolate between frames to increase frame count (x2)

These enable a two-stage pipeline: generate at low resolution quickly, then upscale — a significant VRAM and time saving.

Implications for Users

Understanding the architecture explains several practical points:

Why distilled ≠ dev: Different training objectives, not just fewer steps
Why FP8 needs RTX 40xx: Hardware fp8 matmuls are a Hopper/Ada architecture feature
Why LoRA needs dev model: Distillation changes the model's internal representations
Why VAE is separate: The TAE (Temporal AutoEncoder) is a distinct component from the transformer

Sources

#ltx-2.3#research#architecture#video-generation#transformer