Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents SingSong, a system that generates instrumental music to accompany input vocals
- Builds on recent developments in musical source separation and audio generation
- Applies a state-of-the-art source separation algorithm to a large corpus of music audio
- Adapts AudioLM for conditional “audio-to-audio” generation tasks
- Listeners expressed a preference for instrumentals generated by SingSong compared to a retrieval baseline
Paper Content
Introduction
Related work
- Microsoft Songsmith extracts pitch information from input vocals and predicts a sequence of symbolic chord labels
- Lattner & Grachten (2019) and Grachten et al. (2020) generate symbolic kick drum and bass accompaniments
- Wu et al. (2022) generate audio of drum tracks given music audio without drums
- Task of generating entire accompaniment track based solely on vocal performance
- Body of work around symbolic harmonization, predicting chord labels for symbolic melody input
- Agostinelli et al. (2023) propose MusicLM to generate music audio conditioned on input text descriptions
Task definition and methods
- Task of vocal accompaniment is a conditional generative modeling problem
- Goal is to model a distribution of appropriate instrumental waveforms for vocal waveforms
- Both waveforms are monaural, T seconds in length, and sampled at some rate f s
- Outputs are generated by sampling and linearly mixing the vocals and waveforms
Modeling proxy distributions of audio codes
- Audio requires high sampling rates to be represented
- Modeling audio distributions is challenging
- Model discrete audio codes instead of waveforms
- Use a discrete codec with two functions: Enc and Dec
- Model a proxy distribution over codes produced by Enc
- Approximate the distribution over waveforms by leveraging Dec
- Sampling audio from this distribution involves sampling a proxy and outputting Dec
Audiolm preliminaries
- AudioLM is a state-of-the-art unconditional generative model of audio
- AudioLM uses a factorized approach to model a proxy distribution over multiple types of audio codes
- SoundStream is a pretrained model with an encoder and decoder
- AudioLM “flattens” the coarse and fine codes, inducing rates of 200 and 400 codes per second of audio
- Jukebox is a past music audio generative model that directly modeled representations similar to acoustic codes
- AudioLM proposes to use smaller models to model a joint distribution over acoustic codes and low-rate semantic codes
- AudioLM factorizes the joint distribution over semantic and acoustic codes
- AudioLM is a cascade of three models that generate increasingly high-rate codes
- AudioLM is adapted to model proxy distributions over instrumental codes given vocal codes
- A three-stage factorization is used to generate audio with AudioLM
- T5 is used to predict target codes given input codes
Featurizing input vocals
- We use a function called Feats to extract discrete features from vocal inputs.
- We add white noise to vocal inputs during training and inference.
- The purpose of adding noise is to conceal artifacts of the original instrumental that remain in source-separated vocals.
- We explore a range of featurizations including additive noise and different combinations of semantic and acoustic codes.
Experiments and results
- Training SingSong
- Evaluating SingSong
Datasets
- Training set for SingSong is 1 million audio-only sources, 46k hours of music
- Preprocessed by resampling, averaging stereo to mono
- Preprocessing differs for pre-training SoundStream and w2v-BERT vs. training SingSong
- Extract non-overlapping 10s clips from each mix and input to MDXNet for source separation
- MUSDB18 dataset used for evaluation, 1232 clips for training set, 778 clips for test set
Filtering training data
- Filter out clips where instrumental is silent or vocals are louder than instrumental
- Filter out clips where peak RMS amplitude of instrumental is below -25dB or vocals are at least 5dB louder than instrumental
- Goal of filtering is to bias system towards always outputting some instrumental for all inputs
Evaluation
- Adopted Fréchet Audio Distance (FAD) metric to evaluate audio quality
- FAD is the audio analogue to FID metric used for image generation models
- Compute FAD on MUSDB18 using ground truth mixes as reference audio
- Compute FAD on isolated and source-separated vocals
- Compute negative log-likelihood (NLL) of models over coarse acoustic codes from isolated instrumentals
Modeling hyperparameters
- Adopted default architecture and training hyperparameters from t5.1.1.base configuration
- Increased dropout from 0 to 0.1
- Replaced relative positional embeddings with fixed positional encodings from vanilla Transformer
- Used combination of Noisy and S-SA featurization to improve generalization
- Trained models for 200k steps on 10s clips
Audio featurization experiments
- We explore different audio featurization configurations for input vocals and target instrumentals
- We use Feats function to concatenate semantic and coarse acoustic codes for vocals with optional additive noise
- We experiment with two noise conditions: Clean and Noisy
- As a default for the target, we concatenate semantic and coarse acoustic codes for the instrumental with no noise
- We show that different input vocal featurizations produce different generalization properties
- We experiment with additional featurizations for the input and target
- Our best model adds noise to vocal inputs, uses only semantic codes for vocals as conditioning info, and uses both semantic and coarse acoustic codes for target instrumentals
- Scaling up improves quantitative performance compared to smaller scale
Listening study
- Conducted a listening study to measure performance of two models
- Listeners presented with two 10s vocal-instrumental mixtures
- Vocal identical between mixtures, from MUSDB18-test
- Instrumentals from different sources (ground truth, models, baselines)
- Listeners asked to indicate which mixture has more musically compatible instrumental accompaniments
- Listeners discouraged from paying attention to audio fidelity of instrumental
Baselines
- Random baseline retrieves instrumental clip uniformly at random from MUSDB18-dev
- Retrieval baseline uses musical features of ground truth mixture to retrieve instrumental from MUSDB18-dev and adapt it
- Retrieval baseline similar to Songsmith but retrieves instrumental audio instead of MIDI
- Key and tempo features computed using madmom library
- For input vocal query from MUSDB18-test, key probabilities estimated and tempo detected from ground truth instrumental
- Instrumental track selected from MUSDB18-dev with lowest Euclidean distance in key probability space
- Instrumental time stretched to match estimated tempo of input
Results
- Listeners preferred instrumentals from SingSong-XL 66% of the time compared to the strongest baseline.
- Wilcoxon signed-rank test showed that instrumentals from SingSong-Base and XL were preferred significantly more often than the strongest baseline.
- Listeners preferred instrumentals from SingSong-XL 56% of the time compared to SingSong-Base.
- Listeners preferred instrumentals from SingSong-XL 57% of the time compared to any other source including the ground truth.
Discussion
- SingSong produces instrumentals with strong qualitative performance relative to a strong baseline.
- Instrumentals have clear harmonic and temporal correspondence to the input vocals.
- Instrumentals have weaker harmonic elements compared to percussive elements.
- Results on Vocadito dataset are promising.
A. additional experimental results
- FAD and NLL are compared for two experiments using isolated and source-separated vocals as input
- NLL on isolated vocals decreases monotonically, while FAD on isolated vocals diverges in one experiment
- Subjective opinions agree with FAD more often than with NLL
- Quantitative evaluation and analysis is centered around FAD
- NLL is reported on isolated and source-separated vocal inputs
- Two additional experiments are reported under the Noisy/SA-SA condition
- Removing filtering and using relative positional embeddings results in worse FAD
B. additional listening study details
- Participants in the listener study were asked to rate the best fit between vocals and instrumental tracks.
- Raters were asked to focus on the best fit between the vocals and instrumental tracks, not audio quality.
B.1. additional retrieval baseline details
- Kept ratio of estimated tempi of vocals and instrumentals between 0.5 and 2.0
- Removed 4 tracks from MUSDB18-dev retrieval set that were shorter than 20 seconds
- Adapted AudioLM to be suitable for training conditional “audio-to-audio” generative models of instrumentals given vocals
- Added white noise to input to conceal residual artifacts of instrumental present in source-separated vocals
- Extracted semantic codes from pre-trained w2v-BERT model and coarse acoustic codes from pre-trained SoundStream codec
- Used T5 to predict target codes given input codes
- Experimented with different featurizations of input vocals to improve system’s ability to generalize
- Used FAD and NLL to evaluate performance