ResearchJanuary 6, 2026

LTX-2: Efficient Joint Audio-Visual Foundation Model

Official research paper introducing LTX-2's asymmetric dual-stream transformer architecture with 14B video and 5B audio parameters. Explores modality-aware CFG and temporal synchronization mechanisms.

By Lightricks Research Team

LTX-2: Efficient Joint Audio-Visual Foundation Model

Editor's Note: This paper introduces the LTX-2 architecture behind LTX 2.3. Learn about the asymmetric dual-stream transformer, modality-CFG techniques, training strategy, and benchmark results for audio-visual generation.

This paper from arXiv presents LTX-2, the foundation model architecture that powers LTX 2.3's audio-visual generation capabilities.

Key Contributions

The paper introduces several novel techniques:

  • Asymmetric dual-stream transformer architecture
  • Modality-specific classifier-free guidance (CFG)
  • Efficient training strategy for joint audio-visual generation
  • State-of-the-art benchmark results

Architecture Overview

LTX-2 uses a dual-stream approach where visual and audio modalities are processed through separate but interconnected transformer streams, enabling efficient joint generation while maintaining high quality.

Sources

#research#paper#architecture#dit#transformer#audio-visual