Max W. Y. Lam, Qiao Tian, Tang Li, Zongyu Yin, Siyuan Feng, Ming Tu, Yuliang Ji, Rui Xia, Mingbo Ma, Xuchen Song, Jitong Chen, Yuping Wang, Yuxuan Wang

Abstract
Recent progress in music generation has been remarkably advanced by thestate-of-the-art MusicLM, which comprises a hierarchy of three LMs,respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet,sampling with the MusicLM requires processing through these LMs one by one toobtain the fine-grained acoustic tokens, making it computationally expensiveand prohibitive for a real-time generation. Efficient music generation with aquality on par with MusicLM remains a significant challenge. In this paper, wepresent MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusionmodel that generates music audios of state-of-the-art quality meanwhilereducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM forsemantic modeling, and applies a novel dual-path diffusion (DPD) model and anaudio VAE-GAN to efficiently decode the conditioning semantic tokens intowaveform. DPD is proposed to simultaneously model the coarse and fine acousticsby incorporating the semantic information into segments of latents effectivelyvia cross-attention at each denoising step. Our experimental results suggestthe superiority of MeLoDy, not only in its practical advantages on samplingspeed and infinitely continuable generation, but also in its state-of-the-artmusicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-music-generation-on-musiccaps | MeLoDy | FAD: 5.41 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.