Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Combines mel spectrograms, diffusion models, and neural vocoders to synthesize long-context, high-fidelity music
  • Synthesizes 190 seconds of stereo music at 44.1 kHz without concatenative synthesis, cascading architectures, or compression techniques
  • First work to successfully employ a diffusion-based model for synthesizing long music samples at high sample rates
  • Demo and code available online

Paper Content

Introduction

  • Music is a universal language that connects people from diverse cultures.
  • Researchers have been investigating if computers can capture the creative process behind music creation.
  • Generative modeling has seen significant growth with various techniques.
  • These techniques have achieved human-level performance in tasks such as image generation, speech generation, and text generation.
  • Music synthesis is a challenging task due to the high dimensionality of audio signals.
  • Time-Frequency (TF) representations, such as mel spectrograms, provide a powerful and intuitive way to represent features in audio signals.
  • Autoregressive models and GANs are popular choices for music synthesis, but they each have their own challenges.
  • Diffusion-based models offer fast inference, a simple training procedure, and have recently outperformed GANs in terms of quality.
  • This paper proposes a novel approach for music synthesis using mel spectrograms.
  • The proposed method can synthesize minutes of high-fidelity music at a high sample rate.
  • The proposed method can also be used to solve other audio tasks, such as audio inpainting and style transfer.

Background

  • Audio signals have high dimensionality, making them difficult to represent
  • Audio samples need to be discretized into a large number of samples, requiring a high sample rate
  • For music at CD quality, the sample rate is typically 44.1 kHz
  • To address the challenges of synthesizing long audio samples, a lower-dimensional representation is used to capture important musical features while minimizing computational complexity
  • An expressive but efficient generative model is also necessary

Mel spectrograms

  • Mel Spectrograms are a popular representation of audio used in tasks such as speech synthesis, voice conversion, and music synthesis.
  • Neural vocoders have been developed to approximate both the magnitude and phase from the mel spectrogram.
  • An alternative is to reconstruct only the magnitude spectrogram and approximate the phase using traditional methods.
  • Two losses are used to reconstruct the magnitude spectrogram: Spectral Convergence Loss and Log-Magnitude Loss.

Diffusion

  • Diffusion models are Markovian Hierarchical Variational Autoencoders
  • Diffusion models add random noise to data and learn to synthesize data samples from noise
  • Reparameterization trick allows sampling from forward distribution at any arbitrary timestep
  • Bayes’ rule used to compute mean of distribution
  • Loss term simplified
  • Patch embedding tokenization scheme used for audio based on mel spectrograms
  • Context size reduced to channels x time frames
  • U-Net and Neural Vocoder designed to incorporate considerations

U-net

  • U-Nets are popular for image segmentation tasks
  • U-Nets have been applied to diffusion modeling
  • Proposed U-Net combines strengths of U-Nets and transformers
  • Model captures local and global context
  • Input layer receives 2D spectrogram
  • Output layer performs inverse actions of input layer
  • Residual block uses 2D latent feature and timestep embedding feature
  • Pre-normalization layer used
  • Linear Attention Katharopoulos et al. (2020) used
  • Post-attention mechanism used
  • Downsampling and upsampling used

Neural vocoder

  • Neural vocoder design inspired by ISTFTNet Kaneko et al. (2022)
  • Takes in input mel spectrogram and passes it through input layer, single residual block, and output layer
  • Output is STFT spectrogram
  • Exponential activation applied to transform magnitude spectrogram from log-space to linear-space

Experiments

Dataset

  • Dataset used is POP909 dataset
  • Dataset consists of 909 MIDI files of popular pop songs
  • Audio synthesized from MIDI files using FluidSynth

Data preprocessing

  • STFT window size of 2048 with a hop size of 1024, 128 mel filterbanks used to synthesize mel spectrograms
  • Data-specific preprocessing techniques used, such as moving average parameters for standard scaling and min-max scaling
  • U-Net model used with 256 width and 14 U-Net blocks, 49.8 million parameters
  • Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
  • Audio limited to 8,387,584 samples (190 seconds)
  • Neural Vocoder model used with 256 width and 1.4 million parameters
  • Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
  • Audio limited to 523,264 samples (11 seconds)
  • Both models trained using 16-bit floating point precision

Evaluation

  • Model is under active development
  • Manual evaluation of samples has been performed
  • Evaluation focused on long-term coherence and harmony
  • Samples generated with different seeds
  • No quantitative metrics implemented yet

Results

Sampling

  • Human listeners used to evaluate quality of generated samples
  • Generated samples have good long-term coherence and diverse structures
  • Quality of generated samples is lower than human-generated music
  • Model is able to learn generalizable patterns from small dataset
  • Model struggles with global coherence early on in training
  • Longer training leads to better global coherence and sample quality
  • Increasing number of DDIM sampling steps does not improve quality

Audio-to-audio (style transfer)

  • Add noise to mel spectrogram to generate variations of original audio
  • Low noise levels maintain structure of original audio, but audio is too noisy
  • High noise levels more closely resemble training data, but structure of original audio is lost
  • Percussive sounds less sensitive to noise levels, structure of generated audio is preserved

Interpolation

  • Start with two mel spectrograms and add noise at desired time step
  • Interpolate the two noised spectrograms using a ratio of 0-1
  • Reverse diffusion process used to generate interpolations of audio sources
  • Percussive sounds tend to be more prominent in generated audio
  • Low noise levels preserve musical structure of original audio, high noise levels resemble training data

Inpainting

  • Binary mask used to identify sections of audio to keep/remove
  • Repaint algorithm used to fill in masked sections
  • Results show inpainted sections lack rhythm and don’t accurately capture melody/harmony of original audio
  • Inpainted audio sounds like a completely novel sample and does not resemble original audio
  • Sudden change in musical structure of song in inpainted sections
  • Further experimentation needed to investigate/address this problem

Outpainting

  • Outpainting involves extending audio beyond the original recording
  • Algorithm used for outpainting is the same as for inpainting
  • Results of outpainting are sub-optimal, audio sounds different from original and lacks rhythm

Future work

  • Improve the accuracy of the model
  • Investigate ways to make the model more useful in a music production setting

Conditional generation

  • Conditioning model on lyrics, mood, or MIDI input to control content or style of generated music.
  • Allowing users to provide feedback during synthesis to shape output in real-time.

Evaluation in realistic settings

  • Conduct user studies to assess usefulness and effectiveness of model in music production setting
  • Compare model output to human musicians or existing music production tools using metrics such as subjective quality or task-specific performance

Generalization to other audio tasks

  • Model could be applied to other audio processing tasks
  • Classification tasks, such as genre classification or instrument recognition
  • Audio restoration tasks, such as noise reduction or audio enhancement

Improvements to model components

  • Improve performance of inpainting and outpainting for audio design.
  • Scale model to improve performance.
  • Incorporate new advancements in diffusion model sampling for near real-time synthesis.

Expanding the range of generated music

  • Generating music from a single instrument.
  • Examining how the model handles more complex data with multiple instruments and vocals.
  • Exploring the use of multi-track or stem generation.
  • Investigating the use of the model for generating music in different styles or genres.

Conclusion

  • Introduced Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently
  • Combines mel spectrograms, diffusion models, and neural vocoders
  • Generates high quality audio
  • Generates minutes of coherent audio efficiently
  • Potential for use in various applications in the field of music production
  • Standardizes elements of input x using mean and variance
  • Momentum and momentum decay used to update running statistics during training
  • U-Net architecture can achieve performance gains by increasing width
  • Width must be at least 2x larger than frequency dimension of spectrogram
  • Significant potential for further research and development