ComfyUI Workflow Guide for LTX 2.3: Optimal Settings & Key Nodes
Master LTX 2.3 in ComfyUI with granular control over guidance, resolution up to 2560x1440, 50fps output, and faster inference. Learn the nodes that save VRAM and the exact mistakes that tank quality.
By ltx workflow
ComfyUI Workflow Guide for LTX 2.3
Editor's Note: This guide covers running LTX 2.3 in ComfyUI for superior video quality. Learn optimal settings, key nodes, and workflow configurations for text-to-video and image-to-video generation. Includes installation steps, parameter tuning, and common troubleshooting.
5 million downloads and counting. That's how many developers have pulled LTX-2 & LTX-2.3 from HuggingFace since we released the open weights.
But here's what surprised us: most of them are still using the desktop app for serious work, when the ComfyUI nodes—the official ones we built—actually produce better results in most cases. The difference isn't small. We're talking sharper details, more consistent motion, faster inference, and way more control over every parameter that matters.
The reason is simple: ComfyUI gives you granular access to the inference pipeline. You can tune text guidance separately from cross-modal alignment. You can reuse prompt encodings across batches. You can push resolution to 2560x1440 and framerate to 50fps without fighting a GUI. Desktop simplifies things, but it hides the levers that make the difference.
This guide walks you through every step: installation, optimal settings for text-to-video and image-to-video, the nodes that save you VRAM and time, and the exact mistakes that tank quality.
By the end, you'll have a reproducible workflow that outperforms the desktop app.
Why ComfyUI Outperforms Desktop for LTX-2.3 Video Generation
The question keeps coming up in the community: why does the same model produce noticeably different results in ComfyUI versus Desktop?
The answer is pipeline architecture. Desktop uses a Python pipeline that bundles decisions—it encodes text, computes guidance, runs inference, and decodes the video in a fixed sequence. Solid. Fast. But fixed. You get good defaults, not optimal settings for your specific use case.
ComfyUI exposes the actual pipeline as discrete nodes. That means you control encoding, guidance, and inference separately. You can experiment. You can see what each parameter does. You can iterate without re-encoding text every time.
Here's what this unlocks:
Better quality at higher resolution. The community has confirmed that pushing 2560x1440 and 50fps dramatically improves detail and motion smoothness. Desktop's GUI isn't optimized for these settings. ComfyUI lets you dial in exactly what your hardware can handle.
Reusable conditioning. The LTXVSaveConditioning and LTXVLoadConditioning nodes let you encode a prompt once, then reuse that encoding across multiple inference runs. You save compute, reduce latency, and keep results consistent.
Granular guidance control. The Multimodal Guider node lets you tune prompt adherence and cross-modal consistency independently.
You can dial up motion fluidity without overfitting to the prompt — that's the difference between creepy, over-constrained motion and natural, believable movement.
Faster inference. The January 2026 update brought significant speed improvements to the ComfyUI nodes, primarily because ComfyUI skips GUI overhead and lets you reuse conditioning across runs. Desktop hasn't caught up.
The official ComfyUI-LTXVideo nodes use the latest VAE and the correct inference pipeline. That's why they beat Desktop. It's not magic. It's access.
Setting Up LTX-2.3 in ComfyUI
Prerequisites and Installation
You need three things running:
-
ComfyUI (latest version). If you don't have it, clone from the official repo and install dependencies.
-
ComfyUI-LTXVideo nodes (the official Lightricks package). These don't come with ComfyUI by default, so you'll install them separately.
-
The LTX-2.3 model weights from HuggingFace. You'll specify the model ID when you set up your nodes.
Technically you can run this on CPU, but you'll wait 10 minutes per frame. Seriously. The full bf16 model requires 32GB+ VRAM — an A100 or RTX 6000 Ada gives you comfortable headroom. If you're on a 16GB or 24GB card, you'll need a quantized variant (GGUF or fp8) to fit the model in memory.
Installing ComfyUI-LTXVideo Nodes
The easiest route: use the ComfyUI Manager. Search for "LTXVideo," click Install, restart ComfyUI.
If that doesn't work, clone the repository directly into your ComfyUI custom_nodes folder:
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git
Restart ComfyUI. You'll see the LTX nodes under a new "Video" category in your node browser.
Load the model the first time you use an LTX node. It downloads from HuggingFace and caches locally. First load takes a few minutes depending on your internet. Subsequent loads are instant.
That's it. You're ready.
The Best ComfyUI Workflows for Text-to-Video
A text-to-video workflow is simple in structure but powerful in what it outputs: video from a prompt.
Here's the minimal setup:
-
Gemma API Text Encoding node — Encodes your prompt into embeddings the model understands.
-
LTXVTextToVideoSampler node — Runs the actual inference. Takes the embeddings, noise, and settings, outputs video frames.
-
VAE Decode node — Converts latent space back to pixel space (the video you actually see).
-
Video Combine node — Stitches frames into a video file.
That's the skeleton. Quality comes from the settings you feed into the sampler.
Optimal Settings for Quality
Resolution: Start at 1280x720. You get good results in 5-10 minutes on solid hardware. Once you're comfortable, push to 1920x1080 or 2560x1440. The detail boost is real. The time cost is linear.
Framerate: 24fps is broadcast standard. 30fps feels more fluid. 50fps looks cinematic and catches subtle motion detail. The longer your video, the more 48fps matters. For 8-second clips, it's worth it.
Steps: Steps depend on which model you're running. The distilled model — recommended for most users — runs in 4–8 steps using a manual sigma schedule.
The full model uses 20–50 steps, with 50 as a sensible default that balances quality and speed.
At 40 steps it's noticeably faster with minimal quality loss; at 80 it pushes quality but roughly doubles render time. This guide is built around the full model — if you're on the distilled model, ignore the step counts above and stay in the 4–8 range.
CFG Scale (Guidance): This controls how strictly the model follows your prompt. 3.0-3.5 is the default. Stick to 2.0-5.0 for video.
Seed: Set it to a number (any number) if you want reproducibility. Set it to -1 for randomness. Video generation benefits from slightly different seeds within a series. If one frame looks off, reseed and rerun.
Using the Multimodal Guider
This is the secret weapon.
Under the hood it exposes separate controls for CFG guidance, Spatio-Temporal Guidance (STG), and modality scale — with independent settings for both video and audio streams (this guide focuses on video; the audio controls follow the same logic if you're working with sound).
Here's why the separation matters: sometimes your prompt pulls the video toward strict word fidelity while cross-modal sync pulls toward frame-to-frame semantic consistency.
If you push prompt adherence too hard without tuning sync, you get jittery, overfitted motion. If you lean too far into cross-modal, the video drifts from your prompt.
The Multimodal Guider lets you balance these independently rather than fighting a single CFG slider.
Prompt adherence controls how strictly the model follows your exact wording. Lower values give more creative interpretation; higher values lock closer to the prompt.
Cross-modal sync controls internal consistency and semantic coherence across frames. A modest increase smooths motion and reduces flicker. Push it too far and it tends to backfire.
If your video looks jittery or has temporal artifacts, try increasing cross-modal sync. If it isn't following your prompt, increase prompt adherence. Avoid maxing both at once.
ComfyUI Image-to-Video Workflow with LTX-2.3
Image-to-video is text-to-video's practical cousin. You start with a still image, add a motion prompt, and the model generates video that extends from that starting frame.
The workflow is nearly identical to text-to-video, except you feed the image into an image loader and the LTXVImageToVideoSampler instead of the text sampler.
-
Image Loader node — Points to your starting image (PNG, JPG, 16:9 ratio works best).
-
Resize Image node — Scales to your target resolution (1920x1080, 2560x1440, etc.).
-
Gemma API Text Encoding node — Encodes your motion prompt. Keep it short: "camera pans left, water ripples, gentle motion."
-
LTXVImageToVideoSampler node — Generates video from the image and motion prompt.
-
VAE Decode and Video Combine — Same as text-to-video.
4 Sampling Steps for Better Motion
Step 1: Motion Prompts, Not Scene Prompts
Your prompt should describe motion, not the scene. Bad: "a beautiful mountain landscape." Good: "camera slowly zooms in, clouds drift across the sky, light rays through trees."
Image-to-video assumes you already have the scene. You're adding movement. Write accordingly.
Step 2: Dial Down CFG Scale
Start at 3.0, not 7.0. The model already has the image as ground truth. It doesn't need as much guidance. High CFG can distort the image into something unrecognizable.
Step 3: Increase Steps for Smooth Motion
Image-to-video tends to benefit from a higher step count than text-to-video. If you're running the full model, try 60–80 instead of the standard 50.
The extra passes help smooth motion and reduce jitter at the image-to-video boundary. On the distilled model, stick with 8 steps — pushing beyond that won't improve motion and will just slow you down.
Step 4: Higher Frame Rate Helps
50fps is worth it for image-to-video. It makes the transition from the static image to video motion feel less jarring.
Resolution and Frame Rate Settings
Your image resolution should match your target video resolution. If your image is 1280x720 and you target 2560x1440, you'll get upsampling artifacts. Prep your image first.
The model works best at widescreen aspect ratios (16:9, 21:9). Portrait and square often produce distorted results.
Framerate: match whatever downstream pipeline expects. YouTube likes 24fps. Social media and streaming apps prefer 30fps or 60fps. 48fps gives you the most flexibility in post. Encode once at 48fps, downconvert later if needed.
Advanced Nodes That Save Time and VRAM
Once you understand the basics, these nodes unlock serious efficiency.
Gemma API Text Encoding
By default, ComfyUI-LTXVideo uses a local Gemma encoder. It works but ties up your GPU. The Gemma API Text Encoding node offloads encoding to Lightricks' free API instead.
Why use it:
-
Encodes in under 1 second, regardless of GPU.
-
Frees up your GPU for the expensive sampling step.
-
Stays free indefinitely (no token limits, no soft limits).
-
Perfect if you're iterating on settings and rerunning inference multiple times.
Drop the Gemma API Text Encoding node in place of the local encoder, pass your prompt, and it returns embeddings ready for the sampler. That's it.
Save and Load Conditioning
Here's a workflow superpower: encode your prompt once, save it, then reuse it across 50 different inference runs without re-encoding.
LTXVSaveConditioning node — Takes the encoded embeddings from your text encoder and saves them to disk as a .safetensors file.
LTXVLoadConditioning node — Loads that .npy file in future runs.
Why do this? Because encoding is expensive (few seconds even on GPU). If you're testing 10 different sampler settings with the same prompt, you encode once, then load the conditioning 10 times. You save seconds per run. Across a day of iteration, you save hours.
It also makes your workflow reproducible. Same encoding, same random seed, same settings = identical output every time.
Common Mistakes and How to Fix Them
Mistake 1: Ignoring Aspect Ratio
You feed a 1:1 square image into LTX and expect cinematic output. You get distorted, weird motion. LTX trains on 16:9 footage. Stick to that ratio. Prep your images accordingly.
Fix: Crop or pad images to 16:9 before inference.
Mistake 2: Text Guidance Too High
You set CFG scale to 15 because you want the model to "really follow your prompt." The video becomes stilted, over-constrained, and motion looks robotic.
Fix: Start at 3.0. Go lower (2.0) for creative freedom. Go higher (5.0) only if the model is completely ignoring your prompt.