Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents SingSong, a system that generates instrumental music to accompany input vocals
Builds on recent developments in musical source separation and audio generation
Applies a state-of-the-art source separation algorithm to a large corpus of music audio
Adapts AudioLM for conditional “audio-to-audio” generation tasks
Listeners expressed a preference for instrumentals generated by SingSong compared to a retrieval baseline

Paper Content

Introduction

Microsoft Songsmith extracts pitch information from input vocals and predicts a sequence of symbolic chord labels
Lattner & Grachten (2019) and Grachten et al. (2020) generate symbolic kick drum and bass accompaniments
Wu et al. (2022) generate audio of drum tracks given music audio without drums
Task of generating entire accompaniment track based solely on vocal performance
Body of work around symbolic harmonization, predicting chord labels for symbolic melody input
Agostinelli et al. (2023) propose MusicLM to generate music audio conditioned on input text descriptions

Task definition and methods

Task of vocal accompaniment is a conditional generative modeling problem
Goal is to model a distribution of appropriate instrumental waveforms for vocal waveforms
Both waveforms are monaural, T seconds in length, and sampled at some rate f s
Outputs are generated by sampling and linearly mixing the vocals and waveforms

Modeling proxy distributions of audio codes

Audio requires high sampling rates to be represented
Modeling audio distributions is challenging
Model discrete audio codes instead of waveforms
Use a discrete codec with two functions: Enc and Dec
Model a proxy distribution over codes produced by Enc
Approximate the distribution over waveforms by leveraging Dec
Sampling audio from this distribution involves sampling a proxy and outputting Dec

Audiolm preliminaries

AudioLM is a state-of-the-art unconditional generative model of audio
AudioLM uses a factorized approach to model a proxy distribution over multiple types of audio codes
SoundStream is a pretrained model with an encoder and decoder
AudioLM “flattens” the coarse and fine codes, inducing rates of 200 and 400 codes per second of audio
Jukebox is a past music audio generative model that directly modeled representations similar to acoustic codes
AudioLM proposes to use smaller models to model a joint distribution over acoustic codes and low-rate semantic codes
AudioLM factorizes the joint distribution over semantic and acoustic codes
AudioLM is a cascade of three models that generate increasingly high-rate codes
AudioLM is adapted to model proxy distributions over instrumental codes given vocal codes
A three-stage factorization is used to generate audio with AudioLM
T5 is used to predict target codes given input codes

Featurizing input vocals

We use a function called Feats to extract discrete features from vocal inputs.
We add white noise to vocal inputs during training and inference.
The purpose of adding noise is to conceal artifacts of the original instrumental that remain in source-separated vocals.
We explore a range of featurizations including additive noise and different combinations of semantic and acoustic codes.

Experiments and results

Training SingSong
Evaluating SingSong

Datasets

Training set for SingSong is 1 million audio-only sources, 46k hours of music
Preprocessed by resampling, averaging stereo to mono
Preprocessing differs for pre-training SoundStream and w2v-BERT vs. training SingSong
Extract non-overlapping 10s clips from each mix and input to MDXNet for source separation
MUSDB18 dataset used for evaluation, 1232 clips for training set, 778 clips for test set

Filtering training data

Filter out clips where instrumental is silent or vocals are louder than instrumental
Filter out clips where peak RMS amplitude of instrumental is below -25dB or vocals are at least 5dB louder than instrumental
Goal of filtering is to bias system towards always outputting some instrumental for all inputs

Evaluation

Adopted Fréchet Audio Distance (FAD) metric to evaluate audio quality
FAD is the audio analogue to FID metric used for image generation models
Compute FAD on MUSDB18 using ground truth mixes as reference audio
Compute FAD on isolated and source-separated vocals
Compute negative log-likelihood (NLL) of models over coarse acoustic codes from isolated instrumentals

Modeling hyperparameters

Adopted default architecture and training hyperparameters from t5.1.1.base configuration
Increased dropout from 0 to 0.1
Replaced relative positional embeddings with fixed positional encodings from vanilla Transformer
Used combination of Noisy and S-SA featurization to improve generalization
Trained models for 200k steps on 10s clips

Audio featurization experiments

We explore different audio featurization configurations for input vocals and target instrumentals
We use Feats function to concatenate semantic and coarse acoustic codes for vocals with optional additive noise
We experiment with two noise conditions: Clean and Noisy
As a default for the target, we concatenate semantic and coarse acoustic codes for the instrumental with no noise
We show that different input vocal featurizations produce different generalization properties
We experiment with additional featurizations for the input and target
Our best model adds noise to vocal inputs, uses only semantic codes for vocals as conditioning info, and uses both semantic and coarse acoustic codes for target instrumentals
Scaling up improves quantitative performance compared to smaller scale

Listening study

Conducted a listening study to measure performance of two models
Listeners presented with two 10s vocal-instrumental mixtures
Vocal identical between mixtures, from MUSDB18-test
Instrumentals from different sources (ground truth, models, baselines)
Listeners asked to indicate which mixture has more musically compatible instrumental accompaniments
Listeners discouraged from paying attention to audio fidelity of instrumental

Baselines

Random baseline retrieves instrumental clip uniformly at random from MUSDB18-dev
Retrieval baseline uses musical features of ground truth mixture to retrieve instrumental from MUSDB18-dev and adapt it
Retrieval baseline similar to Songsmith but retrieves instrumental audio instead of MIDI
Key and tempo features computed using madmom library
For input vocal query from MUSDB18-test, key probabilities estimated and tempo detected from ground truth instrumental
Instrumental track selected from MUSDB18-dev with lowest Euclidean distance in key probability space
Instrumental time stretched to match estimated tempo of input

Results

Listeners preferred instrumentals from SingSong-XL 66% of the time compared to the strongest baseline.
Wilcoxon signed-rank test showed that instrumentals from SingSong-Base and XL were preferred significantly more often than the strongest baseline.
Listeners preferred instrumentals from SingSong-XL 56% of the time compared to SingSong-Base.
Listeners preferred instrumentals from SingSong-XL 57% of the time compared to any other source including the ground truth.

Discussion

SingSong produces instrumentals with strong qualitative performance relative to a strong baseline.
Instrumentals have clear harmonic and temporal correspondence to the input vocals.
Instrumentals have weaker harmonic elements compared to percussive elements.
Results on Vocadito dataset are promising.

A. additional experimental results

FAD and NLL are compared for two experiments using isolated and source-separated vocals as input
NLL on isolated vocals decreases monotonically, while FAD on isolated vocals diverges in one experiment
Subjective opinions agree with FAD more often than with NLL
Quantitative evaluation and analysis is centered around FAD
NLL is reported on isolated and source-separated vocal inputs
Two additional experiments are reported under the Noisy/SA-SA condition
Removing filtering and using relative positional embeddings results in worse FAD

B. additional listening study details

Participants in the listener study were asked to rate the best fit between vocals and instrumental tracks.
Raters were asked to focus on the best fit between the vocals and instrumental tracks, not audio quality.

B.1. additional retrieval baseline details

Kept ratio of estimated tempi of vocals and instrumentals between 0.5 and 2.0
Removed 4 tracks from MUSDB18-dev retrieval set that were shorter than 20 seconds
Adapted AudioLM to be suitable for training conditional “audio-to-audio” generative models of instrumentals given vocals
Added white noise to input to conceal residual artifacts of instrumental present in source-separated vocals
Extracted semantic codes from pre-trained w2v-BERT model and coarse acoustic codes from pre-trained SoundStream codec
Used T5 to predict target codes given input codes
Experimented with different featurizations of input vocals to improve system’s ability to generalize
Used FAD and NLL to evaluate performance

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Task definition and methods#

Modeling proxy distributions of audio codes#

Audiolm preliminaries#

Featurizing input vocals#

Experiments and results#

Datasets#

Filtering training data#

Evaluation#

Modeling hyperparameters#

Audio featurization experiments#

Listening study#

Baselines#

Results#

Discussion#

A. additional experimental results#

B. additional listening study details#

B.1. additional retrieval baseline details#

Link to paper

Abstract

Paper Content

Introduction

Related work

Task definition and methods

Modeling proxy distributions of audio codes

Audiolm preliminaries

Featurizing input vocals

Experiments and results

Datasets

Filtering training data

Evaluation

Modeling hyperparameters

Audio featurization experiments

Listening study

Baselines

Results

Discussion

A. additional experimental results

B. additional listening study details

B.1. additional retrieval baseline details