Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models can be used for music generation
- Music generation requires handling multiple aspects
- Developed a cascading latent diffusion approach to generate high-quality stereo music
- Targeting real-time on a single consumer GPU
- Open-sourced music samples, codes and all music samples for all models
Paper Content
Introduction
- Music generation is a challenging problem
- Recently, deep learning models have been used to explore audio generation
- Existing models explore the use of recursive neural networks, adversarial generative networks, autoencoders, and transformers
- Diffusion models have been used in speech synthesis, but are still under-explored for music generation
- Long-term structure, sound quality, diversity of music, and control of generation are challenges in the area of music generation
- Moûsai is a text-conditional cascading diffusion model that tries to address all the challenges
- Moûsai uses a custom two-stage cascading diffusion method
- Moûsai can generate long-context 48kHz stereo music exceeding the minute mark
- Moûsai uses an efficient 1D U-Net architecture for both stages of the cascade
- Moûsai uses a diffusion magnitude autoencoder to compress the audio signal 64x
Related work
- Common trend in generative space is to train a model on input domain and learn a generative model on top of reduced representation
- Auto-encoding and quantized auto-encoding are popular compression methods for images
- Two popular directions in generative space are to learn a quantized representation or use a compressed/downsampled representation
- Cascading diffusion approach has not been attempted for audio generation
- Our work follows ideas from cascading diffusion approach, using a two-stage method to compress audio and generate reduced representation while conditioning on a textual description
Preliminaries
- Diffusion: process of spreading information or resources
- Latent Diffusion: process of spreading information or resources in a hidden way
- U-Net: a type of convolutional neural network
Audio generation
- Audio generation is a challenging task
- Waveforms can be represented in different resolutions
- Higher sample rates allow for more temporal resolution
- Qualitative properties such as texture and pitch can be observed
- Audio can be represented with mono, stereo, or surround sound
- Models can be trained on single or multiple modalities
Diffusion
- Employed v v v-objective diffusion as proposed by Salimans & Ho (2022).
- Used DDIM sampler (Song et al., 2021) to turn noise into a new datapoint.
- DDIM sampler denoises signal by repeated application of an equation.
Latent diffusion
- Audio is compressed into a smaller representation
- Diffusion process is applied to the reduced latent space
- Diffusion based autoencoder is proposed instead of a standard autoencoder
- Increases representation power of decoding process and compressibility
U-net
- U-Nets were first proposed by Ronneberger et al. (2015)
- Used for medial image segmentation, and since repurposed for multiple uses
- Our proposed U-Net has little resemblance to the original work
- Includes more modern convolutional blocks, a variety of attention blocks, conditioning blocks, and improved skip connections
- Moûsai composed of two independently trained models
- First stage (DMAE) compresses audio waveform 64x using a diffusion autoencoder
- Second stage (latent text-to-audio diffusion) generates novel latent space while conditioning on text embeddings
- Both diffusion models use same efficient 1D U-Net architecture with varying configurations
1d u-net
- Used 1D U-Net architecture for autoencoding and latent diffusion
- 1D convolutional kernels are more efficient than 2D
- Used variety of items at each resolution of U-Net: residual, modulation, inject, attention, cross attention
Diffusion magnitude-autoencoding (dmae)
- Diffusion autoencoders were introduced by Preechakul et al. (2022) as a way to condition the diffusion process on a compressed latent vector of the input.
- Magnitude spectrograms are encoded into a latent vector using a 1D convolutional encoder.
- The original waveform is reconstructed by decoding the latent using a diffusion model.
- Phase is discarded to obtain higher compression ratios.
Latent text-to-audio diffusion
- Latent diffusion is applied to the compressed space
- V-objective diffusion is used with a 1D U-Net architecture
- Text embedding is used to generate compressed latent
- Cross attention blocks provide conditioning text embedding
- Multiple attention blocks allow information to be shared over the entire latent
Text conditioning
- Use pre-trained language model to generate text embeddings
- Use classifier-free guidance with a learned mask applied on batch elements with a probability of 0.1
Experimental setup
- Dataset and training setup overview in Section 5.1
- Implementation details in Section 5.2
- Hardware requirements in Section 5.3
Dataset and training setup
- Compiled a collection of 2,500 hours of stereo music
- Autoencoder trained on random crops of length 2 18
- Text-conditional diffusion generation model trained on fixed crops of length 2 21
- Metadata used for textual description includes title, author, album, genre, and year of release
- Metadata list shuffled and elements dropped with probability of 0.1
- Metadata list concatenated with spaces or commas for robustness during inference
Implementation details
- Trained a 185M-parameter diffusion autoencoder with 7 nested U-Net blocks
- No attention used to allow decoding of variable and possibly very long latents
- Channel injection only happens at depth 4
- Trained a 857M text-conditional generator with 6 nested U-Net blocks
- Attention blocks used at depths 0, 0, 1, 1, 1, 1
- Cross attention blocks used at all resolutions
- AdamW optimizer used with learning rate of 10-4, β 1 = 0.95, β 2 = 0.999, = 10-6, and weight decay of 10-3
- Exponential moving average (EMA) used with β = 0.995 and power of 0.7
Hardware requirements
- Training of both models can be done on a single A100 GPU in 1 week using a batch size of 32.
- Inference of a novel audio source of ∼88s can be done in less than ∼88s using a consumer GPU.
Results
- Model generates long-context music from text descriptions
- Most other models do not take text as input
- Riffusion model is the only comparable model
- Evaluated from multiple perspectives: genre diversity, relevance, sound quality, long-term structure
- No perfect evaluation metric for music
- Listen to samples for holistic impression
Diversity & text-to-music relevance
- Conducted a listener test to illustrate diversity and text relevance of Moûsai
- Composed a list of 40 text prompts spanning across 4 music genres
- Generated 80 pieces of music, 2 for each prompt
- Qualitatively observed good diversity and fit to text descriptions
- Conducted psychophysics evaluation with 3 perceivers
- Annotators categorized each sample into 1 of 4 genres
- Moûsai model achieved a good score, generating diverse and genre-specific music
- Riffusion model perceived as more generic, often categorized as Pop
Sound quality
- Evaluated sound quality of generated music
- Model performs well with drum-like sounds in certain genres
Structure
- Model can handle long-term structure, exceeding 1 minute
- Generated samples exhibit rhythm, loops, riffs, and choruses
- Increasing number of attention blocks can improve structure of songs
- U-Net not large enough to learn long-term structure without attention blocks
Additional properties
- Trade-off between speed and quality can be improved by increasing the number of sampling steps
- Trade-off between compression ratio and quality can be improved by using perceptually weighted loss functions
- Text-audio binding works best with CFG higher than 3.0
Future work
- Increasing scale of data and model can improve quality
- Suggest training with 50k-100k hours instead of 2.5k
- Using larger pretrained language model for text embeddings can improve quality
- More sophisticated diffusion samplers and distillation techniques can be used
- Future modelling approaches include perceptual losses, mel-spectrograms, and DreamBooth-like models
Conclusion
- Moûsai is a waveform based audio generation method
- Uses two diffusion models
- First model compresses magnitude only spectrogram 64x
- Second model generates latent from noise while conditioning on text embeddings
- Generates high-quality music in realtime on a consumer GPU
- Provides open-source libraries to facilitate future work
- Evaluation results show model generates music that is correctly categorized into genres