HyperAIHyperAI

Command Palette

Search for a command to run...

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su Yuming Li Zeyue Xue Jie Huang Siming Fu Haoran Li Ying Li Zezhong Qian Haoyang Huang Nan Duan

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at simsimsim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher. extbf{Project Page:} href{https://omniforcing.com}{https://omniforcing.com}

One-sentence Summary

Researchers from JD Explore Academy, Fudan University, Peking University, and the University of Hong Kong propose OmniForcing, a framework that distills bidirectional audio-visual diffusion models into real-time streaming generators. By introducing asymmetric block-causal alignment and audio sink tokens, it overcomes training instability to achieve 25 FPS generation while preserving high-fidelity synchronization.

Key Contributions

  • OmniForcing addresses the high latency of bidirectional joint audio-visual diffusion models by distilling them into a high-fidelity streaming autoregressive generator that enables real-time applications.
  • The framework introduces Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix and an Audio Sink Token mechanism to resolve training instability caused by extreme temporal asymmetry and token sparsity.
  • By employing a Joint Self-Forcing Distillation paradigm and a Modality-Independent Rolling KV-Cache, the method achieves state-of-the-art streaming generation at approximately 25 FPS on a single GPU while maintaining synchronization and visual quality comparable to the teacher model.

Introduction

Recent joint audio-visual diffusion models like LTX-2 deliver high-fidelity synchronized content but rely on bidirectional attention that requires processing the entire timeline at once. This architecture creates prohibitive latency and prevents real-time streaming, while existing workarounds either decouple modalities to degrade quality or fail to stabilize when applied to dual-stream systems due to extreme token sparsity and temporal asymmetry. The authors introduce OmniForcing, the first framework to distill an offline bidirectional model into a high-fidelity streaming autoregressive generator. They resolve training instability through an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix and an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Additionally, a Joint Self-Forcing Distillation paradigm allows the model to self-correct cumulative errors, enabling state-of-the-art streaming generation at approximately 25 FPS on a single GPU.

Method

The authors leverage a dual-stream transformer backbone to enable real-time, streaming joint generation of temporally aligned video and audio. The overall framework, depicted in the OmniForcing pipeline, restructures a pretrained bidirectional model into a block-causal autoregressive system. This process involves a three-stage distillation pipeline designed to transfer the teacher's high-fidelity joint distribution to an ultra-fast causal engine.

The training process follows a sequential distillation paradigm to smoothly decouple few-step denoising from the causal generation paradigm. Stage I employs Bidirectional Distribution Matching Distillation (DMD) to adapt the model for few-step denoising while preserving the global receptive field. Stage II utilizes causal ODE regression to adapt the network weights to the asymmetric block-causal mask, correcting the conditional distribution shift. Finally, Stage III implements joint Self-Forcing training by autoregressively unrolling the generation process to mitigate exposure bias and ensure cross-modal synchrony.

To address the extreme frequency asymmetry between video (3 FPS) and audio (25 FPS) latents, the method employs an Asymmetric Block-Causal Masking design. This approach bridges the information density gap by establishing a physical-time-based Macro-block Alignment. As shown in the figure below, the timeline is partitioned into 1-second macro-blocks, where each block encapsulates a fixed number of video and audio latents without fractional remainders.

The initial components are merged into a Global Prefix block (B0\mathcal{B}_0B0) which functions as a system prompt, remaining globally visible to all future tokens. To prevent gradient explosions and Softmax collapse caused by the sparse history in early audio blocks, the authors introduce learnable Sink Tokens prepended to the audio sequence. These tokens are anchored within the global prefix and utilize an Identity RoPE constraint to remain position-agnostic. During inference, the architecture supports asymmetric compute allocation and parallel inference through modality-independent rolling KV caches, enabling real-time synchronized generation.

Experiment

  • OmniForcing is evaluated against bidirectional and cascaded autoregressive baselines to validate its ability to achieve high-fidelity streaming audio-visual generation with real-time efficiency.
  • The method demonstrates a significant speedup over offline teacher models, enabling true streaming playback with low latency while maintaining visual and audio quality comparable to the strongest joint models.
  • Qualitative analysis confirms the model successfully generates layered sounds, synchronized speech, and complex audio blends that align precisely with visual events.
  • Ablation studies validate that Audio Sink Tokens combined with Identity RoPE are essential for stabilizing training under causal constraints, whereas alternative stabilization methods lead to convergence failures or degraded output quality.
  • Overall, the experiments confirm that OmniForcing achieves a massive reduction in inference time while preserving the perceptual fidelity and cross-modal coherence of the original bidirectional teacher.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp