LTX-2: Efficient Joint Audio-Visual Foundation Model — Paper Breakdown (arxiv:2601.03233)
Breakdown of the LTX-2 paper — the dual-stream transformer architecture with 14B video stream and 5B audio stream that enables native audio-video generation in LTX 2.3.
By ltx workflow
Editor's Note: Breakdown of the LTX-2 Efficient Joint Audio-Visual Foundation Model paper (arxiv:2601.03233) — the architecture behind LTX 2.3's native audio-video generation. Source: arxiv.org
Computer Science > Computer Vision and Pattern Recognition
Title:LTX-2: Efficient Joint Audio-Visual Foundation Model
Abstract:Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.
Submission history
Access Paper:
- View PDF
- HTML (experimental)
- TeX Source
References & Citations
- NASA ADS
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bookmark
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
- Author
- Venue
- Institution
- Topic
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.