Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Combines mel spectrograms, diffusion models, and neural vocoders to synthesize long-context, high-fidelity music
- Synthesizes 190 seconds of stereo music at 44.1 kHz without concatenative synthesis, cascading architectures, or compression techniques
- First work to successfully employ a diffusion-based model for synthesizing long music samples at high sample rates
- Demo and code available online
Paper Content
Introduction
- Music is a universal language that connects people from diverse cultures.
- Researchers have been investigating if computers can capture the creative process behind music creation.
- Generative modeling has seen significant growth with various techniques.
- These techniques have achieved human-level performance in tasks such as image generation, speech generation, and text generation.
- Music synthesis is a challenging task due to the high dimensionality of audio signals.
- Time-Frequency (TF) representations, such as mel spectrograms, provide a powerful and intuitive way to represent features in audio signals.
- Autoregressive models and GANs are popular choices for music synthesis, but they each have their own challenges.
- Diffusion-based models offer fast inference, a simple training procedure, and have recently outperformed GANs in terms of quality.
- This paper proposes a novel approach for music synthesis using mel spectrograms.
- The proposed method can synthesize minutes of high-fidelity music at a high sample rate.
- The proposed method can also be used to solve other audio tasks, such as audio inpainting and style transfer.
Background
- Audio signals have high dimensionality, making them difficult to represent
- Audio samples need to be discretized into a large number of samples, requiring a high sample rate
- For music at CD quality, the sample rate is typically 44.1 kHz
- To address the challenges of synthesizing long audio samples, a lower-dimensional representation is used to capture important musical features while minimizing computational complexity
- An expressive but efficient generative model is also necessary
Mel spectrograms
- Mel Spectrograms are a popular representation of audio used in tasks such as speech synthesis, voice conversion, and music synthesis.
- Neural vocoders have been developed to approximate both the magnitude and phase from the mel spectrogram.
- An alternative is to reconstruct only the magnitude spectrogram and approximate the phase using traditional methods.
- Two losses are used to reconstruct the magnitude spectrogram: Spectral Convergence Loss and Log-Magnitude Loss.
Diffusion
- Diffusion models are Markovian Hierarchical Variational Autoencoders
- Diffusion models add random noise to data and learn to synthesize data samples from noise
- Reparameterization trick allows sampling from forward distribution at any arbitrary timestep
- Bayes’ rule used to compute mean of distribution
- Loss term simplified
- Patch embedding tokenization scheme used for audio based on mel spectrograms
- Context size reduced to channels x time frames
- U-Net and Neural Vocoder designed to incorporate considerations
U-net
- U-Nets are popular for image segmentation tasks
- U-Nets have been applied to diffusion modeling
- Proposed U-Net combines strengths of U-Nets and transformers
- Model captures local and global context
- Input layer receives 2D spectrogram
- Output layer performs inverse actions of input layer
- Residual block uses 2D latent feature and timestep embedding feature
- Pre-normalization layer used
- Linear Attention Katharopoulos et al. (2020) used
- Post-attention mechanism used
- Downsampling and upsampling used
Neural vocoder
- Neural vocoder design inspired by ISTFTNet Kaneko et al. (2022)
- Takes in input mel spectrogram and passes it through input layer, single residual block, and output layer
- Output is STFT spectrogram
- Exponential activation applied to transform magnitude spectrogram from log-space to linear-space
Experiments
Dataset
- Dataset used is POP909 dataset
- Dataset consists of 909 MIDI files of popular pop songs
- Audio synthesized from MIDI files using FluidSynth
Data preprocessing
- STFT window size of 2048 with a hop size of 1024, 128 mel filterbanks used to synthesize mel spectrograms
- Data-specific preprocessing techniques used, such as moving average parameters for standard scaling and min-max scaling
- U-Net model used with 256 width and 14 U-Net blocks, 49.8 million parameters
- Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
- Audio limited to 8,387,584 samples (190 seconds)
- Neural Vocoder model used with 256 width and 1.4 million parameters
- Adam optimizer used with β 1 set to 0.5 and learning rate set to 0.0002
- Audio limited to 523,264 samples (11 seconds)
- Both models trained using 16-bit floating point precision
Evaluation
- Model is under active development
- Manual evaluation of samples has been performed
- Evaluation focused on long-term coherence and harmony
- Samples generated with different seeds
- No quantitative metrics implemented yet
Results
Sampling
- Human listeners used to evaluate quality of generated samples
- Generated samples have good long-term coherence and diverse structures
- Quality of generated samples is lower than human-generated music
- Model is able to learn generalizable patterns from small dataset
- Model struggles with global coherence early on in training
- Longer training leads to better global coherence and sample quality
- Increasing number of DDIM sampling steps does not improve quality
Audio-to-audio (style transfer)
- Add noise to mel spectrogram to generate variations of original audio
- Low noise levels maintain structure of original audio, but audio is too noisy
- High noise levels more closely resemble training data, but structure of original audio is lost
- Percussive sounds less sensitive to noise levels, structure of generated audio is preserved
Interpolation
- Start with two mel spectrograms and add noise at desired time step
- Interpolate the two noised spectrograms using a ratio of 0-1
- Reverse diffusion process used to generate interpolations of audio sources
- Percussive sounds tend to be more prominent in generated audio
- Low noise levels preserve musical structure of original audio, high noise levels resemble training data
Inpainting
- Binary mask used to identify sections of audio to keep/remove
- Repaint algorithm used to fill in masked sections
- Results show inpainted sections lack rhythm and don’t accurately capture melody/harmony of original audio
- Inpainted audio sounds like a completely novel sample and does not resemble original audio
- Sudden change in musical structure of song in inpainted sections
- Further experimentation needed to investigate/address this problem
Outpainting
- Outpainting involves extending audio beyond the original recording
- Algorithm used for outpainting is the same as for inpainting
- Results of outpainting are sub-optimal, audio sounds different from original and lacks rhythm
Future work
- Improve the accuracy of the model
- Investigate ways to make the model more useful in a music production setting
Conditional generation
- Conditioning model on lyrics, mood, or MIDI input to control content or style of generated music.
- Allowing users to provide feedback during synthesis to shape output in real-time.
Evaluation in realistic settings
- Conduct user studies to assess usefulness and effectiveness of model in music production setting
- Compare model output to human musicians or existing music production tools using metrics such as subjective quality or task-specific performance
Generalization to other audio tasks
- Model could be applied to other audio processing tasks
- Classification tasks, such as genre classification or instrument recognition
- Audio restoration tasks, such as noise reduction or audio enhancement
Improvements to model components
- Improve performance of inpainting and outpainting for audio design.
- Scale model to improve performance.
- Incorporate new advancements in diffusion model sampling for near real-time synthesis.
Expanding the range of generated music
- Generating music from a single instrument.
- Examining how the model handles more complex data with multiple instruments and vocals.
- Exploring the use of multi-track or stem generation.
- Investigating the use of the model for generating music in different styles or genres.
Conclusion
- Introduced Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently
- Combines mel spectrograms, diffusion models, and neural vocoders
- Generates high quality audio
- Generates minutes of coherent audio efficiently
- Potential for use in various applications in the field of music production
- Standardizes elements of input x using mean and variance
- Momentum and momentum decay used to update running statistics during training
- U-Net architecture can achieve performance gains by increasing width
- Width must be at least 2x larger than frequency dimension of spectrogram
- Significant potential for further research and development