Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Combines mel spectrograms, diffusion models, and neural vocoders to synthesize long-context, high-fidelity music
Synthesizes 190 seconds of stereo music at 44.1 kHz without concatenative synthesis, cascading architectures, or compression techniques
First work to successfully employ a diffusion-based model for synthesizing long music samples at high sample rates
Demo and code available online

Paper Content

Introduction

Music is a universal language that connects people from diverse cultures.
Researchers have been investigating if computers can capture the creative process behind music creation.
Generative modeling has seen significant growth with various techniques.
These techniques have achieved human-level performance in tasks such as image generation, speech generation, and text generation.
Music synthesis is a challenging task due to the high dimensionality of audio signals.
Time-Frequency (TF) representations, such as mel spectrograms, provide a powerful and intuitive way to represent features in audio signals.
Autoregressive models and GANs are popular choices for music synthesis, but they each have their own challenges.
Diffusion-based models offer fast inference, a simple training procedure, and have recently outperformed GANs in terms of quality.
This paper proposes a novel approach for music synthesis using mel spectrograms.
The proposed method can synthesize minutes of high-fidelity music at a high sample rate.
The proposed method can also be used to solve other audio tasks, such as audio inpainting and style transfer.

Background

Audio signals have high dimensionality, making them difficult to represent
Audio samples need to be discretized into a large number of samples, requiring a high sample rate
For music at CD quality, the sample rate is typically 44.1 kHz
To address the challenges of synthesizing long audio samples, a lower-dimensional representation is used to capture important musical features while minimizing computational complexity
An expressive but efficient generative model is also necessary

Mel spectrograms

Mel Spectrograms are a popular representation of audio used in tasks such as speech synthesis, voice conversion, and music synthesis.
Neural vocoders have been developed to approximate both the magnitude and phase from the mel spectrogram.
An alternative is to reconstruct only the magnitude spectrogram and approximate the phase using traditional methods.
Two losses are used to reconstruct the magnitude spectrogram: Spectral Convergence Loss and Log-Magnitude Loss.

Diffusion

Diffusion models are Markovian Hierarchical Variational Autoencoders
Diffusion models add random noise to data and learn to synthesize data samples from noise
Reparameterization trick allows sampling from forward distribution at any arbitrary timestep
Bayes’ rule used to compute mean of distribution
Loss term simplified
Patch embedding tokenization scheme used for audio based on mel spectrograms
Context size reduced to channels x time frames
U-Net and Neural Vocoder designed to incorporate considerations

U-net

U-Nets are popular for image segmentation tasks
U-Nets have been applied to diffusion modeling
Proposed U-Net combines strengths of U-Nets and transformers
Model captures local and global context
Input layer receives 2D spectrogram
Output layer performs inverse actions of input layer
Residual block uses 2D latent feature and timestep embedding feature
Pre-normalization layer used
Linear Attention Katharopoulos et al. (2020) used
Post-attention mechanism used
Downsampling and upsampling used

Neural vocoder

Neural vocoder design inspired by ISTFTNet Kaneko et al. (2022)
Takes in input mel spectrogram and passes it through input layer, single residual block, and output layer
Output is STFT spectrogram
Exponential activation applied to transform magnitude spectrogram from log-space to linear-space

Experiments

Dataset

Dataset used is POP909 dataset
Dataset consists of 909 MIDI files of popular pop songs
Audio synthesized from MIDI files using FluidSynth

Data preprocessing

STFT window size of 2048 with a hop size of 1024, 128 mel filterbanks used to synthesize mel spectrograms
Data-specific preprocessing techniques used, such as moving average parameters for standard scaling and min-max scaling
U-Net model used with 256 width and 14 U-Net blocks, 49.8 million parameters
Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
Audio limited to 8,387,584 samples (190 seconds)
Neural Vocoder model used with 256 width and 1.4 million parameters
Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
Audio limited to 523,264 samples (11 seconds)
Both models trained using 16-bit floating point precision

Evaluation

Model is under active development
Manual evaluation of samples has been performed
Evaluation focused on long-term coherence and harmony
Samples generated with different seeds
No quantitative metrics implemented yet

Results

Sampling

Human listeners used to evaluate quality of generated samples
Generated samples have good long-term coherence and diverse structures
Quality of generated samples is lower than human-generated music
Model is able to learn generalizable patterns from small dataset
Model struggles with global coherence early on in training
Longer training leads to better global coherence and sample quality
Increasing number of DDIM sampling steps does not improve quality

Audio-to-audio (style transfer)

Add noise to mel spectrogram to generate variations of original audio
Low noise levels maintain structure of original audio, but audio is too noisy
High noise levels more closely resemble training data, but structure of original audio is lost
Percussive sounds less sensitive to noise levels, structure of generated audio is preserved

Interpolation

Start with two mel spectrograms and add noise at desired time step
Interpolate the two noised spectrograms using a ratio of 0-1
Reverse diffusion process used to generate interpolations of audio sources
Percussive sounds tend to be more prominent in generated audio
Low noise levels preserve musical structure of original audio, high noise levels resemble training data

Inpainting

Binary mask used to identify sections of audio to keep/remove
Repaint algorithm used to fill in masked sections
Results show inpainted sections lack rhythm and don’t accurately capture melody/harmony of original audio
Inpainted audio sounds like a completely novel sample and does not resemble original audio
Sudden change in musical structure of song in inpainted sections
Further experimentation needed to investigate/address this problem

Outpainting

Outpainting involves extending audio beyond the original recording
Algorithm used for outpainting is the same as for inpainting
Results of outpainting are sub-optimal, audio sounds different from original and lacks rhythm

Future work

Improve the accuracy of the model
Investigate ways to make the model more useful in a music production setting

Conditional generation

Conditioning model on lyrics, mood, or MIDI input to control content or style of generated music.
Allowing users to provide feedback during synthesis to shape output in real-time.

Evaluation in realistic settings

Conduct user studies to assess usefulness and effectiveness of model in music production setting
Compare model output to human musicians or existing music production tools using metrics such as subjective quality or task-specific performance

Generalization to other audio tasks

Model could be applied to other audio processing tasks
Classification tasks, such as genre classification or instrument recognition
Audio restoration tasks, such as noise reduction or audio enhancement

Improvements to model components

Improve performance of inpainting and outpainting for audio design.
Scale model to improve performance.
Incorporate new advancements in diffusion model sampling for near real-time synthesis.

Expanding the range of generated music

Generating music from a single instrument.
Examining how the model handles more complex data with multiple instruments and vocals.
Exploring the use of multi-track or stem generation.
Investigating the use of the model for generating music in different styles or genres.

Conclusion

Introduced Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently
Combines mel spectrograms, diffusion models, and neural vocoders
Generates high quality audio
Generates minutes of coherent audio efficiently
Potential for use in various applications in the field of music production
Standardizes elements of input x using mean and variance
Momentum and momentum decay used to update running statistics during training
U-Net architecture can achieve performance gains by increasing width
Width must be at least 2x larger than frequency dimension of spectrogram
Significant potential for further research and development

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Mel spectrograms#

Diffusion#

U-net#

Neural vocoder#

Experiments#

Dataset#

Data preprocessing#

Evaluation#

Results#

Sampling#

Audio-to-audio (style transfer)#

Interpolation#

Inpainting#

Outpainting#

Future work#

Conditional generation#

Evaluation in realistic settings#

Generalization to other audio tasks#

Improvements to model components#

Expanding the range of generated music#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Mel spectrograms

Diffusion

U-net

Neural vocoder

Experiments

Dataset

Data preprocessing

Evaluation

Results

Sampling

Audio-to-audio (style transfer)

Interpolation

Inpainting

Outpainting

Future work

Conditional generation

Evaluation in realistic settings

Generalization to other audio tasks

Improvements to model components

Expanding the range of generated music

Conclusion