Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents SingSong, a system that generates instrumental music to accompany input vocals
  • Builds on recent developments in musical source separation and audio generation
  • Applies a state-of-the-art source separation algorithm to a large corpus of music audio
  • Adapts AudioLM for conditional “audio-to-audio” generation tasks
  • Listeners expressed a preference for instrumentals generated by SingSong compared to a retrieval baseline

Paper Content

Introduction

  • Microsoft Songsmith extracts pitch information from input vocals and predicts a sequence of symbolic chord labels
  • Lattner & Grachten (2019) and Grachten et al. (2020) generate symbolic kick drum and bass accompaniments
  • Wu et al. (2022) generate audio of drum tracks given music audio without drums
  • Task of generating entire accompaniment track based solely on vocal performance
  • Body of work around symbolic harmonization, predicting chord labels for symbolic melody input
  • Agostinelli et al. (2023) propose MusicLM to generate music audio conditioned on input text descriptions

Task definition and methods

  • Task of vocal accompaniment is a conditional generative modeling problem
  • Goal is to model a distribution of appropriate instrumental waveforms for vocal waveforms
  • Both waveforms are monaural, T seconds in length, and sampled at some rate f s
  • Outputs are generated by sampling and linearly mixing the vocals and waveforms

Modeling proxy distributions of audio codes

  • Audio requires high sampling rates to be represented
  • Modeling audio distributions is challenging
  • Model discrete audio codes instead of waveforms
  • Use a discrete codec with two functions: Enc and Dec
  • Model a proxy distribution over codes produced by Enc
  • Approximate the distribution over waveforms by leveraging Dec
  • Sampling audio from this distribution involves sampling a proxy and outputting Dec

Audiolm preliminaries

  • AudioLM is a state-of-the-art unconditional generative model of audio
  • AudioLM uses a factorized approach to model a proxy distribution over multiple types of audio codes
  • SoundStream is a pretrained model with an encoder and decoder
  • AudioLM “flattens” the coarse and fine codes, inducing rates of 200 and 400 codes per second of audio
  • Jukebox is a past music audio generative model that directly modeled representations similar to acoustic codes
  • AudioLM proposes to use smaller models to model a joint distribution over acoustic codes and low-rate semantic codes
  • AudioLM factorizes the joint distribution over semantic and acoustic codes
  • AudioLM is a cascade of three models that generate increasingly high-rate codes
  • AudioLM is adapted to model proxy distributions over instrumental codes given vocal codes
  • A three-stage factorization is used to generate audio with AudioLM
  • T5 is used to predict target codes given input codes

Featurizing input vocals

  • We use a function called Feats to extract discrete features from vocal inputs.
  • We add white noise to vocal inputs during training and inference.
  • The purpose of adding noise is to conceal artifacts of the original instrumental that remain in source-separated vocals.
  • We explore a range of featurizations including additive noise and different combinations of semantic and acoustic codes.

Experiments and results

  • Training SingSong
  • Evaluating SingSong

Datasets

  • Training set for SingSong is 1 million audio-only sources, 46k hours of music
  • Preprocessed by resampling, averaging stereo to mono
  • Preprocessing differs for pre-training SoundStream and w2v-BERT vs. training SingSong
  • Extract non-overlapping 10s clips from each mix and input to MDXNet for source separation
  • MUSDB18 dataset used for evaluation, 1232 clips for training set, 778 clips for test set

Filtering training data

  • Filter out clips where instrumental is silent or vocals are louder than instrumental
  • Filter out clips where peak RMS amplitude of instrumental is below -25dB or vocals are at least 5dB louder than instrumental
  • Goal of filtering is to bias system towards always outputting some instrumental for all inputs

Evaluation

  • Adopted Fréchet Audio Distance (FAD) metric to evaluate audio quality
  • FAD is the audio analogue to FID metric used for image generation models
  • Compute FAD on MUSDB18 using ground truth mixes as reference audio
  • Compute FAD on isolated and source-separated vocals
  • Compute negative log-likelihood (NLL) of models over coarse acoustic codes from isolated instrumentals

Modeling hyperparameters

  • Adopted default architecture and training hyperparameters from t5.1.1.base configuration
  • Increased dropout from 0 to 0.1
  • Replaced relative positional embeddings with fixed positional encodings from vanilla Transformer
  • Used combination of Noisy and S-SA featurization to improve generalization
  • Trained models for 200k steps on 10s clips

Audio featurization experiments

  • We explore different audio featurization configurations for input vocals and target instrumentals
  • We use Feats function to concatenate semantic and coarse acoustic codes for vocals with optional additive noise
  • We experiment with two noise conditions: Clean and Noisy
  • As a default for the target, we concatenate semantic and coarse acoustic codes for the instrumental with no noise
  • We show that different input vocal featurizations produce different generalization properties
  • We experiment with additional featurizations for the input and target
  • Our best model adds noise to vocal inputs, uses only semantic codes for vocals as conditioning info, and uses both semantic and coarse acoustic codes for target instrumentals
  • Scaling up improves quantitative performance compared to smaller scale

Listening study

  • Conducted a listening study to measure performance of two models
  • Listeners presented with two 10s vocal-instrumental mixtures
  • Vocal identical between mixtures, from MUSDB18-test
  • Instrumentals from different sources (ground truth, models, baselines)
  • Listeners asked to indicate which mixture has more musically compatible instrumental accompaniments
  • Listeners discouraged from paying attention to audio fidelity of instrumental

Baselines

  • Random baseline retrieves instrumental clip uniformly at random from MUSDB18-dev
  • Retrieval baseline uses musical features of ground truth mixture to retrieve instrumental from MUSDB18-dev and adapt it
  • Retrieval baseline similar to Songsmith but retrieves instrumental audio instead of MIDI
  • Key and tempo features computed using madmom library
  • For input vocal query from MUSDB18-test, key probabilities estimated and tempo detected from ground truth instrumental
  • Instrumental track selected from MUSDB18-dev with lowest Euclidean distance in key probability space
  • Instrumental time stretched to match estimated tempo of input

Results

  • Listeners preferred instrumentals from SingSong-XL 66% of the time compared to the strongest baseline.
  • Wilcoxon signed-rank test showed that instrumentals from SingSong-Base and XL were preferred significantly more often than the strongest baseline.
  • Listeners preferred instrumentals from SingSong-XL 56% of the time compared to SingSong-Base.
  • Listeners preferred instrumentals from SingSong-XL 57% of the time compared to any other source including the ground truth.

Discussion

  • SingSong produces instrumentals with strong qualitative performance relative to a strong baseline.
  • Instrumentals have clear harmonic and temporal correspondence to the input vocals.
  • Instrumentals have weaker harmonic elements compared to percussive elements.
  • Results on Vocadito dataset are promising.

A. additional experimental results

  • FAD and NLL are compared for two experiments using isolated and source-separated vocals as input
  • NLL on isolated vocals decreases monotonically, while FAD on isolated vocals diverges in one experiment
  • Subjective opinions agree with FAD more often than with NLL
  • Quantitative evaluation and analysis is centered around FAD
  • NLL is reported on isolated and source-separated vocal inputs
  • Two additional experiments are reported under the Noisy/SA-SA condition
  • Removing filtering and using relative positional embeddings results in worse FAD

B. additional listening study details

  • Participants in the listener study were asked to rate the best fit between vocals and instrumental tracks.
  • Raters were asked to focus on the best fit between the vocals and instrumental tracks, not audio quality.

B.1. additional retrieval baseline details

  • Kept ratio of estimated tempi of vocals and instrumentals between 0.5 and 2.0
  • Removed 4 tracks from MUSDB18-dev retrieval set that were shorter than 20 seconds
  • Adapted AudioLM to be suitable for training conditional “audio-to-audio” generative models of instrumentals given vocals
  • Added white noise to input to conceal residual artifacts of instrumental present in source-separated vocals
  • Extracted semantic codes from pre-trained w2v-BERT model and coarse acoustic codes from pre-trained SoundStream codec
  • Used T5 to predict target codes given input codes
  • Experimented with different featurizations of input vocals to improve system’s ability to generalize
  • Used FAD and NLL to evaluate performance