ResearchJanuary 1, 2025

LTX-Video: Realtime Video Latent Diffusion — Paper Breakdown (arxiv:2501.00103)

Breakdown of the LTX-Video paper — the transformer-based latent diffusion model that integrates Video-VAE and denoising transformer for realtime video generation.

By ltx workflow

Editor's Note: Breakdown of the original LTX-Video paper (arxiv:2501.00103) — the transformer-based latent diffusion model that achieves realtime video generation. Source: arxiv.org

Computer Science > Computer Vision and Pattern Recognition

Title:LTX-Video: Realtime Video Latent Diffusion

Abstract:We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

Submission history

Access Paper:

  • View PDF
  • HTML (experimental)
  • TeX Source

References & Citations

  • NASA ADS
  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

            Replicate Toggle
          
        
        
          Replicate (What is Replicate?)
        
      
      
        
          
            
            
            Spaces Toggle
          
        
        
          Hugging Face Spaces (What is Spaces?)
        
      
      
        
          
            
            
            Spaces Toggle
          
        
        
          TXYZ.AI (What is TXYZ.AI?)
        
      
    
    
    
    
  
  
  Related Papers
  
    Recommenders and Search Tools
  • Author
  • Venue
  • Institution
  • Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Sources

#research#arxiv#ltx-video#diffusion#vae