Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce MusicLM, a model that generates high-fidelity music from text descriptions
  • MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task
  • MusicLM generates music at 24 kHz that remains consistent over several minutes
  • MusicLM outperforms previous systems in audio quality and adherence to the text description
  • MusicLM can be conditioned on both text and a melody
  • Release MusicCaps, a dataset of 5.5k music-text pairs with rich text descriptions

Paper Content

Introduction

  • Conditional neural audio generation covers a wide range of applications
  • Recent work has explored generating audio from sequence-wide, high-level captions
  • Generating audio from such captions is limited to simple acoustic scenes
  • Turning a single text caption into a rich audio sequence with long-term structure is an open challenge
  • AudioLM is a framework for audio generation
  • AudioLM learns to generate realistic audio from audio-only corpora
  • Audio-text data is scarce compared to image data
  • MusicLM is a model for generating high-fidelity music from text descriptions
  • MusicLM leverages AudioLM’s multi-stage autoregressive modeling
  • MusicLM is trained on a large dataset of unlabeled music
  • MusicCaps is a new high-quality music caption dataset with 5.5k examples
  • MusicLM outperforms previous systems in terms of quality and adherence to the caption
  • MusicLM supports conditioning signals beyond text
  • Transformer-based autoregressive models and U-Net-based diffusion models are the current state-of-the-art in generative modeling.
  • This section reviews related work with a focus on autoregressive generative models operating on discrete tokens, similar to MusicLM.

Quantization

  • Autoregressive modeling is a powerful approach for natural language processing and image/video generation
  • Quantization is a key component for autoregressive models for continuous signals
  • VQ-VAEs provide compact, discrete representations with high-fidelity reconstruction
  • SoundStream is a neural audio codec that uses RVQ to compress audio at low bitrates
  • RVQ avoids exponential blowup in codebook size as bitrate increases
  • SoundStream was extended by EnCodec to higher bitrates and stereophonic audio

Generative models for audio

  • Jukebox and PerceiverAR have both tackled the problem of generating high-quality audio with long-term consistency
  • AudioLM addresses the trade-off between coherence and high-quality synthesis by relying on a hierarchical tokenization and generation scheme
  • MusicLM builds on top of AudioLM and adds the ability to condition the generation process on a descriptive text, melody, and a variety of long music sequences

Conditioned audio generation

  • Generating audio from text has been tackled by several works
  • DiffSound and AudioGen use text encoders and decoders to predict audio
  • Mubert and Riffusion are used as baselines
  • Symbolic representations of music can also be used to drive the generative process
  • MusicLM enables a more natural and intuitive way of providing a conditioning signal

Text-conditioned image generation

  • Text-conditioned image generation models have made significant progress in quality.
  • DALLE 2 is the closest to the approach in the paper.
  • The paper uses a joint music-text embedding model for text encoding.
  • The decoder is based on AudioLM.
  • The model is trained on an audio-only dataset.

Joint embedding models for music and text

  • MuLan is a music-text joint embedding model
  • MuLan maps two modalities to a shared embedding space of 128 dimensions
  • MuLan is trained on pairs of music clips and their corresponding text annotations
  • MuLan can link music to unconstrained natural language descriptions
  • MuLan is applicable for retrieval or zero-shot music tagging

Method

  • MusicLM is a computer science paper
  • MusicLM has two components: audio representations and text-conditioned music generation
  • Audio representations are described in Section 3.1
  • Text-conditioned music generation is described in Section 3.2

Representation and tokenization of audio and text

  • We use three models to extract audio representations for conditional autoregressive music generation
  • SoundStream provides acoustic tokens for high-fidelity synthesis
  • w2v-BERT provides semantic tokens for long-term coherent generation
  • MuLan provides audio and text embeddings for sequence-to-sequence modeling
  • SoundStream model is 24 kHz monophonic audio with a striding factor of 480
  • Quantization of SoundStream embeddings is learned during training
  • w2v-BERT model has 600M parameters and 25 semantic tokens per second
  • MuLan audio embeddings are quantized and averaged for longer audio sequences
  • MuLan text embeddings are used for conditioning during inference

Hierarchical modeling of audio representations

  • Combines discrete audio representations with AudioLM to generate text-conditioned music
  • Hierarchical sequence-to-sequence modeling task, each stage modeled autoregressively by a separate decoder-only Transformer
  • First stage is semantic modeling, learns mapping from MuLan audio tokens to semantic tokens
  • Second stage is acoustic modeling, acoustic tokens predicted conditioned on MuLan audio tokens and semantic tokens
  • AudioLM proposed to split acoustic modeling stage into coarse and fine stages

Experimental setup

Models

  • Decoder-only Transformers used for AudioLM semantic and acoustic stages
  • 24 layers, 16 attention heads, 1024 embedding dimension, 4096 feed-forward layers, 0.1 dropout, relative positional embeddings
  • 430M parameters per stage

Training and inference

  • MuLan is a pretrained and frozen model that requires audio-only data for training.
  • The other components of MusicLM are trained on the Free Music Archive (FMA) dataset.
  • Multiple passes are used to train each stage of MusicLM.
  • 30 and 10-second random crops of the target audio are used for the semantic and acoustic stages respectively.
  • AudioLM is trained on 3-second crops.
  • Temperature sampling is used for autoregressive sampling in all stages.

Evaluation dataset

  • MusicLM is evaluated using MusicCaps, a publicly available dataset of 5.5k music clips.
  • Each clip is paired with a text description written by professional musicians.
  • MusicCaps includes free-text captions and a list of music aspects.
  • MusicCaps is genre-balanced and contains audio clips from AudioSet.

Metrics

  • Compute different metrics to evaluate MusicLM
  • Fréchet Audio Distance (FAD) correlates well with human perception
  • Models with low FAD score generate plausible audio
  • Many-to-many relationship between text descriptions and music clips
  • Use LEAF classifier to compute class predictions for generated and reference music
  • Measure KL divergence between probability distributions of class predictions
  • MuLan used to quantify similarity between music-text pairs
  • Use A-vs-B human rating task to evaluate adherence to text description
  • Training data memorization studied using Carlini et al. (2022) methodology
  • Compute histogram of semantic token counts and use Sinkhorn algorithm to find matching cost

Results

  • MusicLM is compared to two recent baselines for music generation from descriptive text
  • MusicLM is able to generate high-quality music comparable to Mubert
  • MusicLM is able to capture more information from the text descriptions compared to the baselines
  • MusicLM is preferred over both baselines in a human listening test
  • MusicLM is able to capture fine-grained information from the rich free-text captions of MusicCaps
  • Decoupling semantic modeling from acoustic modeling is useful
  • Audio tokens capture information about genre, rhythmical properties, and part of the main melody
  • Semantic tokens facilitate the adherence to the text description
  • Memorization analysis shows that exact matches remain very small

Extensions

  • MusicLM can generate music based on both a text description and a melody
  • Synthetic dataset composed of audio pairs with matching melodies but different acoustics
  • Extract melody conditioning for MusicLM by quantizing the melody embeddings
  • Generate long audio sequences with a stride of 15 seconds and changing text description over time

Conclusions

  • MusicLM is a text-conditioned generative model that produces high-quality music
  • MusicLM outperforms baselines on MusicCaps dataset
  • Limitations of MusicLM include misunderstanding negations and not adhering to temporal ordering
  • Future work may focus on lyrics generation, text conditioning and vocal quality
  • MusicLM raises questions about appropriateness for music generation for cultures underrepresented in the training data
  • Study of memorization found only a tiny fraction of examples were memorized exactly
  • Human listener study found raters had a decisive model preference in all cases except Mubert vs. Riffusion
  • Evaluation of generated samples using captions from MusicCaps dataset