Tutorials

LTX 2.3 Audio VAE: Setting Up Audio-to-Video Workflows in ComfyUI

Learn how to use LTX 2.3's native Audio VAE for synchronized audio-video generation in ComfyUI — required files, node setup, and generation tips for talking avatars and audio-driven scenes.

By ltx workflow

Editor's Note: LTX 2.3 introduced a native Audio VAE that enables synchronized audio-video generation — not post-processing dubbing, but joint generation in a single pass. This guide explains what the Audio VAE is, what files you need, and how to set it up in ComfyUI.

What Is the LTX 2.3 Audio VAE?

LTX 2.3 is architecturally different from most AI video models: it generates video and audio simultaneously in a single model pass. Audio is not added as a post-process step — the model jointly produces both streams, synchronized from the start.

The Audio VAE (LTX23_audio_vae_bf16.safetensors) is the component that encodes and decodes the audio latent space. It works alongside the Video VAE and the main transformer to produce outputs where speech, ambient sound, and environmental audio are synchronized with the video content.

This enables workflows like:

Image + audio → video: provide a portrait image and a speech audio file; the model generates a video with the subject's mouth moving in sync
Text-conditioned audio-video: describe a scene and generate both visuals and matching audio together
Talking avatar generation: natural lip-sync from a still image and speech input

Required Files

For audio-conditioned workflows, you need these files in addition to your standard checkpoint:

File	Location	Purpose
`taeltx2_3.safetensors`	`models/vae/`	Standard video VAE — required for all workflows
`LTX23_audio_vae_bf16.safetensors`	`models/vae/`	Audio VAE — required for audio-video workflows
`LTX23_video_vae_bf16.safetensors`	`models/vae/`	Standalone video VAE BF16 (alternative to taeltx2_3)
`ltx-2.3_text_projection_bf16.safetensors`	`models/text_encoders/`	Text projection layer for workflows using separate encoders

Download all from: Kijai/LTX2.3_comfy on HuggingFace

The Audio VAE is ~365MB — a small download compared to the main checkpoint.

Setting Up in ComfyUI

Step 1: Download and place files

# Audio VAE → models/vae/
LTX23_audio_vae_bf16.safetensors

# Video VAE (alternative) → models/vae/
LTX23_video_vae_bf16.safetensors

# Text projection → models/text_encoders/
ltx-2.3_text_projection_bf16.safetensors

Step 2: Update ComfyUI-LTXVideo nodes

Audio-video workflows require a recent version of the ComfyUI-LTXVideo custom nodes. Update via ComfyUI Manager or:

cd ComfyUI/custom_nodes/ComfyUI-LTXVideo
git pull
pip install -r requirements.txt

Step 3: Load an audio-video workflow

Audio-conditioned workflows are structurally different from standard text-to-video workflows. The key nodes you will see:

Audio Loader — loads your input audio file (.wav, .mp3)
LTX Audio VAE Encode — encodes audio into the latent space
LTX Image-to-Video Conditioning — combines image + audio conditioning
LTX Video Sampler — runs the main generation
LTX Audio VAE Decode — decodes audio from latent back to waveform
Video Combine — merges video frames + decoded audio into output file

Community workflows are available on Civitai — LTX 2.3 audio workflows and the ComfyUI-LTXVideo examples folder.

Supported Checkpoint + Audio VAE Combinations

The Audio VAE works with all main LTX 2.3 checkpoints. Choose your checkpoint based on VRAM as usual:

VRAM	Recommended checkpoint
16GB	`ltx-2.3-22b-distilled-1.1_transformer_only_fp8_scaled.safetensors`
24GB	`ltx-2.3-22b-distilled-1.1.safetensors` + sequential offloading
32GB	`ltx-2.3-22b-distilled-1.1.safetensors`

The Audio VAE itself only uses ~1GB VRAM regardless of which main checkpoint you use.

Generation Tips

Input audio:

Speech works best — clear, clean mono or stereo recording
Recommended sample rate: 16kHz or 44.1kHz
Keep audio clips short (matching your target video length: 25–97 frames at your chosen FPS)
Use a vocal separator node (e.g. MelBand Roformer) if your audio has background music

Resolution and frames:

Portrait (9:16) works well for talking avatars: 576×1024
Landscape (16:9) for general scenes: 768×512 or 1280×720
Frame count must be 8n+1: 25, 49, or 97 frames
Match video length to audio length — e.g. 49 frames at 24fps ≈ 2 seconds

When audio quality is lower:

LTX 2.3 notes that audio without speech may have lower quality — the model is primarily optimized for speech-synchronized generation
For ambient sound or music, results vary; speech input gives the most consistent synchronization

What LTX 2.3 Audio Generation Is (and Isn't)

It is: Joint audio-video generation where both streams are produced together, giving natural timing and lip-sync from a single model.

It is not: A voice cloning or TTS system. You provide the audio — the model generates video synchronized to it. The audio output in the generated video is derived from your input audio through the VAE encode/decode cycle.

It is not a replacement for dedicated audio post-production. For professional voice work or precise audio design, dedicated tools will give more control. The LTX audio pipeline shines for rapid, synchronized talking avatar and audio-driven scene generation.

Sources

#ltx-2.3#comfyui#audio#audio-to-video#talking-avatar