Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models can be used for music generation
Music generation requires handling multiple aspects
Developed a cascading latent diffusion approach to generate high-quality stereo music
Targeting real-time on a single consumer GPU
Open-sourced music samples, codes and all music samples for all models

Paper Content

Introduction

Music generation is a challenging problem
Recently, deep learning models have been used to explore audio generation
Existing models explore the use of recursive neural networks, adversarial generative networks, autoencoders, and transformers
Diffusion models have been used in speech synthesis, but are still under-explored for music generation
Long-term structure, sound quality, diversity of music, and control of generation are challenges in the area of music generation
Moûsai is a text-conditional cascading diffusion model that tries to address all the challenges
Moûsai uses a custom two-stage cascading diffusion method
Moûsai can generate long-context 48kHz stereo music exceeding the minute mark
Moûsai uses an efficient 1D U-Net architecture for both stages of the cascade
Moûsai uses a diffusion magnitude autoencoder to compress the audio signal 64x

Common trend in generative space is to train a model on input domain and learn a generative model on top of reduced representation
Auto-encoding and quantized auto-encoding are popular compression methods for images
Two popular directions in generative space are to learn a quantized representation or use a compressed/downsampled representation
Cascading diffusion approach has not been attempted for audio generation
Our work follows ideas from cascading diffusion approach, using a two-stage method to compress audio and generate reduced representation while conditioning on a textual description

Preliminaries

Diffusion: process of spreading information or resources
Latent Diffusion: process of spreading information or resources in a hidden way
U-Net: a type of convolutional neural network

Audio generation

Audio generation is a challenging task
Waveforms can be represented in different resolutions
Higher sample rates allow for more temporal resolution
Qualitative properties such as texture and pitch can be observed
Audio can be represented with mono, stereo, or surround sound
Models can be trained on single or multiple modalities

Diffusion

Employed v v v-objective diffusion as proposed by Salimans & Ho (2022).
Used DDIM sampler (Song et al., 2021) to turn noise into a new datapoint.
DDIM sampler denoises signal by repeated application of an equation.

Latent diffusion

Audio is compressed into a smaller representation
Diffusion process is applied to the reduced latent space
Diffusion based autoencoder is proposed instead of a standard autoencoder
Increases representation power of decoding process and compressibility

U-net

U-Nets were first proposed by Ronneberger et al. (2015)
Used for medial image segmentation, and since repurposed for multiple uses
Our proposed U-Net has little resemblance to the original work
Includes more modern convolutional blocks, a variety of attention blocks, conditioning blocks, and improved skip connections
Moûsai composed of two independently trained models
First stage (DMAE) compresses audio waveform 64x using a diffusion autoencoder
Second stage (latent text-to-audio diffusion) generates novel latent space while conditioning on text embeddings
Both diffusion models use same efficient 1D U-Net architecture with varying configurations

1d u-net

Used 1D U-Net architecture for autoencoding and latent diffusion
1D convolutional kernels are more efficient than 2D
Used variety of items at each resolution of U-Net: residual, modulation, inject, attention, cross attention

Diffusion magnitude-autoencoding (dmae)

Diffusion autoencoders were introduced by Preechakul et al. (2022) as a way to condition the diffusion process on a compressed latent vector of the input.
Magnitude spectrograms are encoded into a latent vector using a 1D convolutional encoder.
The original waveform is reconstructed by decoding the latent using a diffusion model.
Phase is discarded to obtain higher compression ratios.

Latent text-to-audio diffusion

Latent diffusion is applied to the compressed space
V-objective diffusion is used with a 1D U-Net architecture
Text embedding is used to generate compressed latent
Cross attention blocks provide conditioning text embedding
Multiple attention blocks allow information to be shared over the entire latent

Text conditioning

Use pre-trained language model to generate text embeddings
Use classifier-free guidance with a learned mask applied on batch elements with a probability of 0.1

Experimental setup

Dataset and training setup overview in Section 5.1
Implementation details in Section 5.2
Hardware requirements in Section 5.3

Dataset and training setup

Compiled a collection of 2,500 hours of stereo music
Autoencoder trained on random crops of length 2 18
Text-conditional diffusion generation model trained on fixed crops of length 2 21
Metadata used for textual description includes title, author, album, genre, and year of release
Metadata list shuffled and elements dropped with probability of 0.1
Metadata list concatenated with spaces or commas for robustness during inference

Implementation details

Trained a 185M-parameter diffusion autoencoder with 7 nested U-Net blocks
No attention used to allow decoding of variable and possibly very long latents
Channel injection only happens at depth 4
Trained a 857M text-conditional generator with 6 nested U-Net blocks
Attention blocks used at depths 0, 0, 1, 1, 1, 1
Cross attention blocks used at all resolutions
AdamW optimizer used with learning rate of 10-4, β 1 = 0.95, β 2 = 0.999, = 10-6, and weight decay of 10-3
Exponential moving average (EMA) used with β = 0.995 and power of 0.7

Hardware requirements

Training of both models can be done on a single A100 GPU in 1 week using a batch size of 32.
Inference of a novel audio source of ∼88s can be done in less than ∼88s using a consumer GPU.

Results

Model generates long-context music from text descriptions
Most other models do not take text as input
Riffusion model is the only comparable model
Evaluated from multiple perspectives: genre diversity, relevance, sound quality, long-term structure
No perfect evaluation metric for music
Listen to samples for holistic impression

Diversity & text-to-music relevance

Conducted a listener test to illustrate diversity and text relevance of Moûsai
Composed a list of 40 text prompts spanning across 4 music genres
Generated 80 pieces of music, 2 for each prompt
Qualitatively observed good diversity and fit to text descriptions
Conducted psychophysics evaluation with 3 perceivers
Annotators categorized each sample into 1 of 4 genres
Moûsai model achieved a good score, generating diverse and genre-specific music
Riffusion model perceived as more generic, often categorized as Pop

Sound quality

Evaluated sound quality of generated music
Model performs well with drum-like sounds in certain genres

Structure

Model can handle long-term structure, exceeding 1 minute
Generated samples exhibit rhythm, loops, riffs, and choruses
Increasing number of attention blocks can improve structure of songs
U-Net not large enough to learn long-term structure without attention blocks

Additional properties

Trade-off between speed and quality can be improved by increasing the number of sampling steps
Trade-off between compression ratio and quality can be improved by using perceptually weighted loss functions
Text-audio binding works best with CFG higher than 3.0

Future work

Increasing scale of data and model can improve quality
Suggest training with 50k-100k hours instead of 2.5k
Using larger pretrained language model for text embeddings can improve quality
More sophisticated diffusion samplers and distillation techniques can be used
Future modelling approaches include perceptual losses, mel-spectrograms, and DreamBooth-like models

Conclusion

Moûsai is a waveform based audio generation method
Uses two diffusion models
First model compresses magnitude only spectrogram 64x
Second model generates latent from noise while conditioning on text embeddings
Generates high-quality music in realtime on a consumer GPU
Provides open-source libraries to facilitate future work
Evaluation results show model generates music that is correctly categorized into genres

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Preliminaries#

Audio generation#

Diffusion#

Latent diffusion#

U-net#

1d u-net#

Diffusion magnitude-autoencoding (dmae)#

Latent text-to-audio diffusion#

Text conditioning#

Experimental setup#

Dataset and training setup#

Implementation details#

Hardware requirements#

Results#

Diversity & text-to-music relevance#

Sound quality#

Structure#

Additional properties#

Future work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Preliminaries

Audio generation

Diffusion

Latent diffusion

U-net

1d u-net

Diffusion magnitude-autoencoding (dmae)

Latent text-to-audio diffusion

Text conditioning

Experimental setup

Dataset and training setup

Implementation details

Hardware requirements

Results

Diversity & text-to-music relevance

Sound quality

Structure

Additional properties

Future work

Conclusion