Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models can be used for music generation
  • Music generation requires handling multiple aspects
  • Developed a cascading latent diffusion approach to generate high-quality stereo music
  • Targeting real-time on a single consumer GPU
  • Open-sourced music samples, codes and all music samples for all models

Paper Content

Introduction

  • Music generation is a challenging problem
  • Recently, deep learning models have been used to explore audio generation
  • Existing models explore the use of recursive neural networks, adversarial generative networks, autoencoders, and transformers
  • Diffusion models have been used in speech synthesis, but are still under-explored for music generation
  • Long-term structure, sound quality, diversity of music, and control of generation are challenges in the area of music generation
  • Moûsai is a text-conditional cascading diffusion model that tries to address all the challenges
  • Moûsai uses a custom two-stage cascading diffusion method
  • Moûsai can generate long-context 48kHz stereo music exceeding the minute mark
  • Moûsai uses an efficient 1D U-Net architecture for both stages of the cascade
  • Moûsai uses a diffusion magnitude autoencoder to compress the audio signal 64x
  • Common trend in generative space is to train a model on input domain and learn a generative model on top of reduced representation
  • Auto-encoding and quantized auto-encoding are popular compression methods for images
  • Two popular directions in generative space are to learn a quantized representation or use a compressed/downsampled representation
  • Cascading diffusion approach has not been attempted for audio generation
  • Our work follows ideas from cascading diffusion approach, using a two-stage method to compress audio and generate reduced representation while conditioning on a textual description

Preliminaries

  • Diffusion: process of spreading information or resources
  • Latent Diffusion: process of spreading information or resources in a hidden way
  • U-Net: a type of convolutional neural network

Audio generation

  • Audio generation is a challenging task
  • Waveforms can be represented in different resolutions
  • Higher sample rates allow for more temporal resolution
  • Qualitative properties such as texture and pitch can be observed
  • Audio can be represented with mono, stereo, or surround sound
  • Models can be trained on single or multiple modalities

Diffusion

  • Employed v v v-objective diffusion as proposed by Salimans & Ho (2022).
  • Used DDIM sampler (Song et al., 2021) to turn noise into a new datapoint.
  • DDIM sampler denoises signal by repeated application of an equation.

Latent diffusion

  • Audio is compressed into a smaller representation
  • Diffusion process is applied to the reduced latent space
  • Diffusion based autoencoder is proposed instead of a standard autoencoder
  • Increases representation power of decoding process and compressibility

U-net

  • U-Nets were first proposed by Ronneberger et al. (2015)
  • Used for medial image segmentation, and since repurposed for multiple uses
  • Our proposed U-Net has little resemblance to the original work
  • Includes more modern convolutional blocks, a variety of attention blocks, conditioning blocks, and improved skip connections
  • Moûsai composed of two independently trained models
  • First stage (DMAE) compresses audio waveform 64x using a diffusion autoencoder
  • Second stage (latent text-to-audio diffusion) generates novel latent space while conditioning on text embeddings
  • Both diffusion models use same efficient 1D U-Net architecture with varying configurations

1d u-net

  • Used 1D U-Net architecture for autoencoding and latent diffusion
  • 1D convolutional kernels are more efficient than 2D
  • Used variety of items at each resolution of U-Net: residual, modulation, inject, attention, cross attention

Diffusion magnitude-autoencoding (dmae)

  • Diffusion autoencoders were introduced by Preechakul et al. (2022) as a way to condition the diffusion process on a compressed latent vector of the input.
  • Magnitude spectrograms are encoded into a latent vector using a 1D convolutional encoder.
  • The original waveform is reconstructed by decoding the latent using a diffusion model.
  • Phase is discarded to obtain higher compression ratios.

Latent text-to-audio diffusion

  • Latent diffusion is applied to the compressed space
  • V-objective diffusion is used with a 1D U-Net architecture
  • Text embedding is used to generate compressed latent
  • Cross attention blocks provide conditioning text embedding
  • Multiple attention blocks allow information to be shared over the entire latent

Text conditioning

  • Use pre-trained language model to generate text embeddings
  • Use classifier-free guidance with a learned mask applied on batch elements with a probability of 0.1

Experimental setup

  • Dataset and training setup overview in Section 5.1
  • Implementation details in Section 5.2
  • Hardware requirements in Section 5.3

Dataset and training setup

  • Compiled a collection of 2,500 hours of stereo music
  • Autoencoder trained on random crops of length 2 18
  • Text-conditional diffusion generation model trained on fixed crops of length 2 21
  • Metadata used for textual description includes title, author, album, genre, and year of release
  • Metadata list shuffled and elements dropped with probability of 0.1
  • Metadata list concatenated with spaces or commas for robustness during inference

Implementation details

  • Trained a 185M-parameter diffusion autoencoder with 7 nested U-Net blocks
  • No attention used to allow decoding of variable and possibly very long latents
  • Channel injection only happens at depth 4
  • Trained a 857M text-conditional generator with 6 nested U-Net blocks
  • Attention blocks used at depths 0, 0, 1, 1, 1, 1
  • Cross attention blocks used at all resolutions
  • AdamW optimizer used with learning rate of 10-4, β 1 = 0.95, β 2 = 0.999, = 10-6, and weight decay of 10-3
  • Exponential moving average (EMA) used with β = 0.995 and power of 0.7

Hardware requirements

  • Training of both models can be done on a single A100 GPU in 1 week using a batch size of 32.
  • Inference of a novel audio source of ∼88s can be done in less than ∼88s using a consumer GPU.

Results

  • Model generates long-context music from text descriptions
  • Most other models do not take text as input
  • Riffusion model is the only comparable model
  • Evaluated from multiple perspectives: genre diversity, relevance, sound quality, long-term structure
  • No perfect evaluation metric for music
  • Listen to samples for holistic impression

Diversity & text-to-music relevance

  • Conducted a listener test to illustrate diversity and text relevance of Moûsai
  • Composed a list of 40 text prompts spanning across 4 music genres
  • Generated 80 pieces of music, 2 for each prompt
  • Qualitatively observed good diversity and fit to text descriptions
  • Conducted psychophysics evaluation with 3 perceivers
  • Annotators categorized each sample into 1 of 4 genres
  • Moûsai model achieved a good score, generating diverse and genre-specific music
  • Riffusion model perceived as more generic, often categorized as Pop

Sound quality

  • Evaluated sound quality of generated music
  • Model performs well with drum-like sounds in certain genres

Structure

  • Model can handle long-term structure, exceeding 1 minute
  • Generated samples exhibit rhythm, loops, riffs, and choruses
  • Increasing number of attention blocks can improve structure of songs
  • U-Net not large enough to learn long-term structure without attention blocks

Additional properties

  • Trade-off between speed and quality can be improved by increasing the number of sampling steps
  • Trade-off between compression ratio and quality can be improved by using perceptually weighted loss functions
  • Text-audio binding works best with CFG higher than 3.0

Future work

  • Increasing scale of data and model can improve quality
  • Suggest training with 50k-100k hours instead of 2.5k
  • Using larger pretrained language model for text embeddings can improve quality
  • More sophisticated diffusion samplers and distillation techniques can be used
  • Future modelling approaches include perceptual losses, mel-spectrograms, and DreamBooth-like models

Conclusion

  • Moûsai is a waveform based audio generation method
  • Uses two diffusion models
  • First model compresses magnitude only spectrogram 64x
  • Second model generates latent from noise while conditioning on text embeddings
  • Generates high-quality music in realtime on a consumer GPU
  • Provides open-source libraries to facilitate future work
  • Evaluation results show model generates music that is correctly categorized into genres