ResearchJanuary 7, 2026

LTX-2: Efficient Joint Audio-Visual Foundation Model — Paper Breakdown (arxiv:2601.03233)

Breakdown of the LTX-2 paper — the dual-stream transformer architecture with 14B video stream and 5B audio stream that enables native audio-video generation in LTX 2.3.

By ltx workflow

Editor's Note: Breakdown of the LTX-2 Efficient Joint Audio-Visual Foundation Model paper (arxiv:2601.03233) — the architecture behind LTX 2.3's native audio-video generation. Source: arxiv.org

Computer Science > Computer Vision and Pattern Recognition

Title:LTX-2: Efficient Joint Audio-Visual Foundation Model

Abstract:Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

Submission history

Access Paper:

View PDF
HTML (experimental)
TeX Source

References & Citations

NASA ADS
Google Scholar
Semantic Scholar

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

            Replicate Toggle
          
        
        
          Replicate (What is Replicate?)
        
      
      
        
          
            
            
            Spaces Toggle
          
        
        
          Hugging Face Spaces (What is Spaces?)
        
      
      
        
          
            
            
            Spaces Toggle
          
        
        
          TXYZ.AI (What is TXYZ.AI?)
        
      
    
    
    
    
  
  
  Related Papers
  
    Recommenders and Search Tools

Author
Venue
Institution
Topic

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Sources

Efficient Joint Audio-Visual Foundation Model (arxiv:2601.03233)

#research#arxiv#ltx-2.3#audio-video#transformer