Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • This paper deals with the problem of audio source separation.
  • Deep neural networks are used to obtain instrumental spectra from a mixture.
  • A novel network architecture is proposed that extends the recently developed densely connected convolutional network (DenseNet).
  • An up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet.
  • The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results.
  • The proposed architecture requires fewer parameters and less training time compared with other methods.

Paper Content

Introduction

  • Audio source separation has attracted attention in the last decade
  • Various approaches have been introduced, including local Gaussian modeling, non-negative factorization, kernel additive modeling, and combinations of those approaches
  • Recently, deep neural networks (DNNs) based source separation methods have shown significant improvement
  • Standard feed-forward fully connected network (FNN) and long short term memory (LSTM) have been used to obtain source spectra
  • Convolutional Neural Network (CNN) has been used in audio and video tasks
  • ResNets and Highway Networks address the problem of training deep networks
  • Densely connected convolutional networks (DenseNet) has shown excellent performance on image recognition task
  • DenseNet is proposed for audio source separation to take advantage of long contexts and fine-grained structures
  • Dense blocks are proposed dedicated to particular frequency bands
  • Proposed method outperforms state of the art and reduces training time and number of parameters

Multi-scale multi-band densenet

  • Summarize DenseNet architecture
  • Introduce up-scaling blocks and inter block skip connections to deal with high dimensional inputs and outputs
  • Introduce multi-band DenseNet architecture to improve modeling efficiency and capability
  • Outline complete architectures

Densenet

  • Standard feed forward networks use non-linear transformations to compute output
  • ResNet uses a skip connection to make training of deep architectures easier
  • DenseNet replaces simple addition of output with concatenation of all preceding layers
  • DenseNet uses BN, ReLU and convolution with k feature maps
  • Down-sampling layer is used to capture global information efficiently

Multi-scale densenet with block skip connection and transposed convolution

  • Dense blocks and down-sampling layers make up the downsampling path of the proposed multi-scale DenseNet.
  • An upsampling layer is introduced to recover the original resolution from lower resolution feature maps.
  • Inter-block skip connections are introduced to allow forward and backward signal flow without passing though lower resolution blocks.
  • MDenseNet is a fully convolutional architecture that can be applied to arbitrary input length.

Multi-band mdensenet

  • Convolution layer kernels are shared across the entire input field.
  • Limited kernel sharing is more suitable for efficiently capturing local patterns in audio.
  • Split input into multiple bands and apply multiscale DenseNet to each band.
  • Concatenate output from full band MDenseNet with outputs from multiple sub-band MDenseNets.
  • Full band MDenseNet can focus on modeling rough global structure.

Architecture details

  • Proposed network architectures for audio source separation are described in Table 1
  • MMDenseNet allows for individual design of each band and assigning of computational resources according to importance of each band

Experiments

Setup

  • Evaluated proposed method on DSD100 dataset
  • Dataset consists of 50 songs each in Dev and Test sets
  • Task is to separate songs into 4 source instruments or vocals and accompaniment
  • Used spectrogram of mixture as input and trained network to estimate target spectrogram
  • Compared method with other state-of-the-art approaches
  • MDenseNet performed as good as BLSTM
  • MMDenseNet significantly improved performance and outperformed all baselines
  • MM-DenseNet+ further improved performances and showed best overall result

Architecture validation

  • Proposed multi-sale dense block enables network to model signal on different scales.
  • Validated if dense blocks contribute to recovering target spectrogram by computing map-wise l2norm of filter weights.
  • Comparing l2-norms of up-sampling path and skip connection path showed that every dense block at different scale contributes reasonably.

Model efficiency

  • Proposed architecture encourages feature reuse, leading to a compact and efficient model
  • Number of parameters of proposed architectures is significantly less than baseline methods
  • MDenseNet achieved comparable performance to state-of-the-art BLEND with only 1.5% of the parameters
  • MM-DenseNet outperformed BLEND with only 3.6% of the parameters
  • Training time is significantly less than for BLSTM and BLEND methods

Conclusion

  • Extended DenseNet to tackle audio source separation
  • Proposed architectures have dense blocks at multiple scales connected through down-sampling and up-sampling layers
  • Proposed multi-band DenseNet to enable kernels in convolution layer to learn more effectively
  • Outperformed state-of-the-art by a large margin while reducing model size and training time