Tutorials

LTX 2.3 Audio VAE: Setting Up Audio-to-Video Workflows in ComfyUI

Learn how to use LTX 2.3's native Audio VAE for synchronized audio-video generation in ComfyUI — required files, node setup, and generation tips for talking avatars and audio-driven scenes.

By ltx workflow

Editor's Note: LTX 2.3 introduced a native Audio VAE that enables synchronized audio-video generation — not post-processing dubbing, but joint generation in a single pass. This guide explains what the Audio VAE is, what files you need, and how to set it up in ComfyUI.

What Is the LTX 2.3 Audio VAE?

LTX 2.3 is architecturally different from most AI video models: it generates video and audio simultaneously in a single model pass. Audio is not added as a post-process step — the model jointly produces both streams, synchronized from the start.

The Audio VAE (LTX23_audio_vae_bf16.safetensors) is the component that encodes and decodes the audio latent space. It works alongside the Video VAE and the main transformer to produce outputs where speech, ambient sound, and environmental audio are synchronized with the video content.

This enables workflows like:

  • Image + audio → video: provide a portrait image and a speech audio file; the model generates a video with the subject's mouth moving in sync
  • Text-conditioned audio-video: describe a scene and generate both visuals and matching audio together
  • Talking avatar generation: natural lip-sync from a still image and speech input

Required Files

For audio-conditioned workflows, you need these files in addition to your standard checkpoint:

FileLocationPurpose
taeltx2_3.safetensorsmodels/vae/Standard video VAE — required for all workflows
LTX23_audio_vae_bf16.safetensorsmodels/vae/Audio VAE — required for audio-video workflows
LTX23_video_vae_bf16.safetensorsmodels/vae/Standalone video VAE BF16 (alternative to taeltx2_3)
ltx-2.3_text_projection_bf16.safetensorsmodels/text_encoders/Text projection layer for workflows using separate encoders

Download all from: Kijai/LTX2.3_comfy on HuggingFace

The Audio VAE is ~365MB — a small download compared to the main checkpoint.

Setting Up in ComfyUI

Step 1: Download and place files

# Audio VAE → models/vae/
LTX23_audio_vae_bf16.safetensors

# Video VAE (alternative) → models/vae/
LTX23_video_vae_bf16.safetensors

# Text projection → models/text_encoders/
ltx-2.3_text_projection_bf16.safetensors

Step 2: Update ComfyUI-LTXVideo nodes

Audio-video workflows require a recent version of the ComfyUI-LTXVideo custom nodes. Update via ComfyUI Manager or:

cd ComfyUI/custom_nodes/ComfyUI-LTXVideo
git pull
pip install -r requirements.txt

Step 3: Load an audio-video workflow

Audio-conditioned workflows are structurally different from standard text-to-video workflows. The key nodes you will see:

  • Audio Loader — loads your input audio file (.wav, .mp3)
  • LTX Audio VAE Encode — encodes audio into the latent space
  • LTX Image-to-Video Conditioning — combines image + audio conditioning
  • LTX Video Sampler — runs the main generation
  • LTX Audio VAE Decode — decodes audio from latent back to waveform
  • Video Combine — merges video frames + decoded audio into output file

Community workflows are available on Civitai — LTX 2.3 audio workflows and the ComfyUI-LTXVideo examples folder.

Supported Checkpoint + Audio VAE Combinations

The Audio VAE works with all main LTX 2.3 checkpoints. Choose your checkpoint based on VRAM as usual:

VRAMRecommended checkpoint
16GBltx-2.3-22b-distilled-1.1_transformer_only_fp8_scaled.safetensors
24GBltx-2.3-22b-distilled-1.1.safetensors + sequential offloading
32GBltx-2.3-22b-distilled-1.1.safetensors

The Audio VAE itself only uses ~1GB VRAM regardless of which main checkpoint you use.

Generation Tips

Input audio:

  • Speech works best — clear, clean mono or stereo recording
  • Recommended sample rate: 16kHz or 44.1kHz
  • Keep audio clips short (matching your target video length: 25–97 frames at your chosen FPS)
  • Use a vocal separator node (e.g. MelBand Roformer) if your audio has background music

Resolution and frames:

  • Portrait (9:16) works well for talking avatars: 576×1024
  • Landscape (16:9) for general scenes: 768×512 or 1280×720
  • Frame count must be 8n+1: 25, 49, or 97 frames
  • Match video length to audio length — e.g. 49 frames at 24fps ≈ 2 seconds

When audio quality is lower:

  • LTX 2.3 notes that audio without speech may have lower quality — the model is primarily optimized for speech-synchronized generation
  • For ambient sound or music, results vary; speech input gives the most consistent synchronization

What LTX 2.3 Audio Generation Is (and Isn't)

It is: Joint audio-video generation where both streams are produced together, giving natural timing and lip-sync from a single model.

It is not: A voice cloning or TTS system. You provide the audio — the model generates video synchronized to it. The audio output in the generated video is derived from your input audio through the VAE encode/decode cycle.

It is not a replacement for dedicated audio post-production. For professional voice work or precise audio design, dedicated tools will give more control. The LTX audio pipeline shines for rapid, synchronized talking avatar and audio-driven scene generation.

Sources

#ltx-2.3#comfyui#audio#audio-to-video#talking-avatar