Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

This paper deals with the problem of audio source separation.
Deep neural networks are used to obtain instrumental spectra from a mixture.
A novel network architecture is proposed that extends the recently developed densely connected convolutional network (DenseNet).
An up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet.
The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results.
The proposed architecture requires fewer parameters and less training time compared with other methods.

Paper Content

Introduction

Audio source separation has attracted attention in the last decade
Various approaches have been introduced, including local Gaussian modeling, non-negative factorization, kernel additive modeling, and combinations of those approaches
Recently, deep neural networks (DNNs) based source separation methods have shown significant improvement
Standard feed-forward fully connected network (FNN) and long short term memory (LSTM) have been used to obtain source spectra
Convolutional Neural Network (CNN) has been used in audio and video tasks
ResNets and Highway Networks address the problem of training deep networks
Densely connected convolutional networks (DenseNet) has shown excellent performance on image recognition task
DenseNet is proposed for audio source separation to take advantage of long contexts and fine-grained structures
Dense blocks are proposed dedicated to particular frequency bands
Proposed method outperforms state of the art and reduces training time and number of parameters

Multi-scale multi-band densenet

Summarize DenseNet architecture
Introduce up-scaling blocks and inter block skip connections to deal with high dimensional inputs and outputs
Introduce multi-band DenseNet architecture to improve modeling efficiency and capability
Outline complete architectures

Densenet

Standard feed forward networks use non-linear transformations to compute output
ResNet uses a skip connection to make training of deep architectures easier
DenseNet replaces simple addition of output with concatenation of all preceding layers
DenseNet uses BN, ReLU and convolution with k feature maps
Down-sampling layer is used to capture global information efficiently

Multi-scale densenet with block skip connection and transposed convolution

Dense blocks and down-sampling layers make up the downsampling path of the proposed multi-scale DenseNet.
An upsampling layer is introduced to recover the original resolution from lower resolution feature maps.
Inter-block skip connections are introduced to allow forward and backward signal flow without passing though lower resolution blocks.
MDenseNet is a fully convolutional architecture that can be applied to arbitrary input length.

Multi-band mdensenet

Convolution layer kernels are shared across the entire input field.
Limited kernel sharing is more suitable for efficiently capturing local patterns in audio.
Split input into multiple bands and apply multiscale DenseNet to each band.
Concatenate output from full band MDenseNet with outputs from multiple sub-band MDenseNets.
Full band MDenseNet can focus on modeling rough global structure.

Architecture details

Proposed network architectures for audio source separation are described in Table 1
MMDenseNet allows for individual design of each band and assigning of computational resources according to importance of each band

Experiments

Setup

Evaluated proposed method on DSD100 dataset
Dataset consists of 50 songs each in Dev and Test sets
Task is to separate songs into 4 source instruments or vocals and accompaniment
Used spectrogram of mixture as input and trained network to estimate target spectrogram
Compared method with other state-of-the-art approaches
MDenseNet performed as good as BLSTM
MMDenseNet significantly improved performance and outperformed all baselines
MM-DenseNet+ further improved performances and showed best overall result

Architecture validation

Proposed multi-sale dense block enables network to model signal on different scales.
Validated if dense blocks contribute to recovering target spectrogram by computing map-wise l2norm of filter weights.
Comparing l2-norms of up-sampling path and skip connection path showed that every dense block at different scale contributes reasonably.

Model efficiency

Proposed architecture encourages feature reuse, leading to a compact and efficient model
Number of parameters of proposed architectures is significantly less than baseline methods
MDenseNet achieved comparable performance to state-of-the-art BLEND with only 1.5% of the parameters
MM-DenseNet outperformed BLEND with only 3.6% of the parameters
Training time is significantly less than for BLSTM and BLEND methods

Conclusion

Extended DenseNet to tackle audio source separation
Proposed architectures have dense blocks at multiple scales connected through down-sampling and up-sampling layers
Proposed multi-band DenseNet to enable kernels in convolution layer to learn more effectively
Outperformed state-of-the-art by a large margin while reducing model size and training time

Link to paper#

Abstract#

Paper Content#

Introduction#

Multi-scale multi-band densenet#

Densenet#

Multi-scale densenet with block skip connection and transposed convolution#

Multi-band mdensenet#

Architecture details#

Experiments#

Setup#

Architecture validation#

Model efficiency#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Multi-scale multi-band densenet

Densenet

Multi-scale densenet with block skip connection and transposed convolution

Multi-band mdensenet

Architecture details

Experiments

Setup

Architecture validation

Model efficiency

Conclusion