LTX-2: Efficient Joint Audio-Visual Foundation Model
Official research paper introducing LTX-2's asymmetric dual-stream transformer architecture with 14B video and 5B audio parameters. Explores modality-aware CFG and temporal synchronization mechanisms.
By Lightricks Research Team
LTX-2: Efficient Joint Audio-Visual Foundation Model
Editor's Note: This paper introduces the LTX-2 architecture behind LTX 2.3. Learn about the asymmetric dual-stream transformer, modality-CFG techniques, training strategy, and benchmark results for audio-visual generation.
This paper from arXiv presents LTX-2, the foundation model architecture that powers LTX 2.3's audio-visual generation capabilities.
Key Contributions
The paper introduces several novel techniques:
- Asymmetric dual-stream transformer architecture
- Modality-specific classifier-free guidance (CFG)
- Efficient training strategy for joint audio-visual generation
- State-of-the-art benchmark results
Architecture Overview
LTX-2 uses a dual-stream approach where visual and audio modalities are processed through separate but interconnected transformer streams, enabling efficient joint generation while maintaining high quality.