Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce MusicLM, a model that generates high-fidelity music from text descriptions
- MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task
- MusicLM generates music at 24 kHz that remains consistent over several minutes
- MusicLM outperforms previous systems in audio quality and adherence to the text description
- MusicLM can be conditioned on both text and a melody
- Release MusicCaps, a dataset of 5.5k music-text pairs with rich text descriptions
Paper Content
Introduction
- Conditional neural audio generation covers a wide range of applications
- Recent work has explored generating audio from sequence-wide, high-level captions
- Generating audio from such captions is limited to simple acoustic scenes
- Turning a single text caption into a rich audio sequence with long-term structure is an open challenge
- AudioLM is a framework for audio generation
- AudioLM learns to generate realistic audio from audio-only corpora
- Audio-text data is scarce compared to image data
- MusicLM is a model for generating high-fidelity music from text descriptions
- MusicLM leverages AudioLM’s multi-stage autoregressive modeling
- MusicLM is trained on a large dataset of unlabeled music
- MusicCaps is a new high-quality music caption dataset with 5.5k examples
- MusicLM outperforms previous systems in terms of quality and adherence to the caption
- MusicLM supports conditioning signals beyond text
Background and related work
- Transformer-based autoregressive models and U-Net-based diffusion models are the current state-of-the-art in generative modeling.
- This section reviews related work with a focus on autoregressive generative models operating on discrete tokens, similar to MusicLM.
Quantization
- Autoregressive modeling is a powerful approach for natural language processing and image/video generation
- Quantization is a key component for autoregressive models for continuous signals
- VQ-VAEs provide compact, discrete representations with high-fidelity reconstruction
- SoundStream is a neural audio codec that uses RVQ to compress audio at low bitrates
- RVQ avoids exponential blowup in codebook size as bitrate increases
- SoundStream was extended by EnCodec to higher bitrates and stereophonic audio
Generative models for audio
- Jukebox and PerceiverAR have both tackled the problem of generating high-quality audio with long-term consistency
- AudioLM addresses the trade-off between coherence and high-quality synthesis by relying on a hierarchical tokenization and generation scheme
- MusicLM builds on top of AudioLM and adds the ability to condition the generation process on a descriptive text, melody, and a variety of long music sequences
Conditioned audio generation
- Generating audio from text has been tackled by several works
- DiffSound and AudioGen use text encoders and decoders to predict audio
- Mubert and Riffusion are used as baselines
- Symbolic representations of music can also be used to drive the generative process
- MusicLM enables a more natural and intuitive way of providing a conditioning signal
Text-conditioned image generation
- Text-conditioned image generation models have made significant progress in quality.
- DALLE 2 is the closest to the approach in the paper.
- The paper uses a joint music-text embedding model for text encoding.
- The decoder is based on AudioLM.
- The model is trained on an audio-only dataset.
Joint embedding models for music and text
- MuLan is a music-text joint embedding model
- MuLan maps two modalities to a shared embedding space of 128 dimensions
- MuLan is trained on pairs of music clips and their corresponding text annotations
- MuLan can link music to unconstrained natural language descriptions
- MuLan is applicable for retrieval or zero-shot music tagging
Method
- MusicLM is a computer science paper
- MusicLM has two components: audio representations and text-conditioned music generation
- Audio representations are described in Section 3.1
- Text-conditioned music generation is described in Section 3.2
Representation and tokenization of audio and text
- We use three models to extract audio representations for conditional autoregressive music generation
- SoundStream provides acoustic tokens for high-fidelity synthesis
- w2v-BERT provides semantic tokens for long-term coherent generation
- MuLan provides audio and text embeddings for sequence-to-sequence modeling
- SoundStream model is 24 kHz monophonic audio with a striding factor of 480
- Quantization of SoundStream embeddings is learned during training
- w2v-BERT model has 600M parameters and 25 semantic tokens per second
- MuLan audio embeddings are quantized and averaged for longer audio sequences
- MuLan text embeddings are used for conditioning during inference
Hierarchical modeling of audio representations
- Combines discrete audio representations with AudioLM to generate text-conditioned music
- Hierarchical sequence-to-sequence modeling task, each stage modeled autoregressively by a separate decoder-only Transformer
- First stage is semantic modeling, learns mapping from MuLan audio tokens to semantic tokens
- Second stage is acoustic modeling, acoustic tokens predicted conditioned on MuLan audio tokens and semantic tokens
- AudioLM proposed to split acoustic modeling stage into coarse and fine stages
Experimental setup
Models
- Decoder-only Transformers used for AudioLM semantic and acoustic stages
- 24 layers, 16 attention heads, 1024 embedding dimension, 4096 feed-forward layers, 0.1 dropout, relative positional embeddings
- 430M parameters per stage
Training and inference
- MuLan is a pretrained and frozen model that requires audio-only data for training.
- The other components of MusicLM are trained on the Free Music Archive (FMA) dataset.
- Multiple passes are used to train each stage of MusicLM.
- 30 and 10-second random crops of the target audio are used for the semantic and acoustic stages respectively.
- AudioLM is trained on 3-second crops.
- Temperature sampling is used for autoregressive sampling in all stages.
Evaluation dataset
- MusicLM is evaluated using MusicCaps, a publicly available dataset of 5.5k music clips.
- Each clip is paired with a text description written by professional musicians.
- MusicCaps includes free-text captions and a list of music aspects.
- MusicCaps is genre-balanced and contains audio clips from AudioSet.
Metrics
- Compute different metrics to evaluate MusicLM
- Fréchet Audio Distance (FAD) correlates well with human perception
- Models with low FAD score generate plausible audio
- Many-to-many relationship between text descriptions and music clips
- Use LEAF classifier to compute class predictions for generated and reference music
- Measure KL divergence between probability distributions of class predictions
- MuLan used to quantify similarity between music-text pairs
- Use A-vs-B human rating task to evaluate adherence to text description
- Training data memorization studied using Carlini et al. (2022) methodology
- Compute histogram of semantic token counts and use Sinkhorn algorithm to find matching cost
Results
- MusicLM is compared to two recent baselines for music generation from descriptive text
- MusicLM is able to generate high-quality music comparable to Mubert
- MusicLM is able to capture more information from the text descriptions compared to the baselines
- MusicLM is preferred over both baselines in a human listening test
- MusicLM is able to capture fine-grained information from the rich free-text captions of MusicCaps
- Decoupling semantic modeling from acoustic modeling is useful
- Audio tokens capture information about genre, rhythmical properties, and part of the main melody
- Semantic tokens facilitate the adherence to the text description
- Memorization analysis shows that exact matches remain very small
Extensions
- MusicLM can generate music based on both a text description and a melody
- Synthetic dataset composed of audio pairs with matching melodies but different acoustics
- Extract melody conditioning for MusicLM by quantizing the melody embeddings
- Generate long audio sequences with a stride of 15 seconds and changing text description over time
Conclusions
- MusicLM is a text-conditioned generative model that produces high-quality music
- MusicLM outperforms baselines on MusicCaps dataset
- Limitations of MusicLM include misunderstanding negations and not adhering to temporal ordering
- Future work may focus on lyrics generation, text conditioning and vocal quality
- MusicLM raises questions about appropriateness for music generation for cultures underrepresented in the training data
- Study of memorization found only a tiny fraction of examples were memorized exactly
- Human listener study found raters had a decisive model preference in all cases except Mubert vs. Riffusion
- Evaluation of generated samples using captions from MusicCaps dataset