Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- This paper deals with the problem of audio source separation.
- Deep neural networks are used to obtain instrumental spectra from a mixture.
- A novel network architecture is proposed that extends the recently developed densely connected convolutional network (DenseNet).
- An up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet.
- The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results.
- The proposed architecture requires fewer parameters and less training time compared with other methods.
Paper Content
Introduction
- Audio source separation has attracted attention in the last decade
- Various approaches have been introduced, including local Gaussian modeling, non-negative factorization, kernel additive modeling, and combinations of those approaches
- Recently, deep neural networks (DNNs) based source separation methods have shown significant improvement
- Standard feed-forward fully connected network (FNN) and long short term memory (LSTM) have been used to obtain source spectra
- Convolutional Neural Network (CNN) has been used in audio and video tasks
- ResNets and Highway Networks address the problem of training deep networks
- Densely connected convolutional networks (DenseNet) has shown excellent performance on image recognition task
- DenseNet is proposed for audio source separation to take advantage of long contexts and fine-grained structures
- Dense blocks are proposed dedicated to particular frequency bands
- Proposed method outperforms state of the art and reduces training time and number of parameters
Multi-scale multi-band densenet
- Summarize DenseNet architecture
- Introduce up-scaling blocks and inter block skip connections to deal with high dimensional inputs and outputs
- Introduce multi-band DenseNet architecture to improve modeling efficiency and capability
- Outline complete architectures
Densenet
- Standard feed forward networks use non-linear transformations to compute output
- ResNet uses a skip connection to make training of deep architectures easier
- DenseNet replaces simple addition of output with concatenation of all preceding layers
- DenseNet uses BN, ReLU and convolution with k feature maps
- Down-sampling layer is used to capture global information efficiently
Multi-scale densenet with block skip connection and transposed convolution
- Dense blocks and down-sampling layers make up the downsampling path of the proposed multi-scale DenseNet.
- An upsampling layer is introduced to recover the original resolution from lower resolution feature maps.
- Inter-block skip connections are introduced to allow forward and backward signal flow without passing though lower resolution blocks.
- MDenseNet is a fully convolutional architecture that can be applied to arbitrary input length.
Multi-band mdensenet
- Convolution layer kernels are shared across the entire input field.
- Limited kernel sharing is more suitable for efficiently capturing local patterns in audio.
- Split input into multiple bands and apply multiscale DenseNet to each band.
- Concatenate output from full band MDenseNet with outputs from multiple sub-band MDenseNets.
- Full band MDenseNet can focus on modeling rough global structure.
Architecture details
- Proposed network architectures for audio source separation are described in Table 1
- MMDenseNet allows for individual design of each band and assigning of computational resources according to importance of each band
Experiments
Setup
- Evaluated proposed method on DSD100 dataset
- Dataset consists of 50 songs each in Dev and Test sets
- Task is to separate songs into 4 source instruments or vocals and accompaniment
- Used spectrogram of mixture as input and trained network to estimate target spectrogram
- Compared method with other state-of-the-art approaches
- MDenseNet performed as good as BLSTM
- MMDenseNet significantly improved performance and outperformed all baselines
- MM-DenseNet+ further improved performances and showed best overall result
Architecture validation
- Proposed multi-sale dense block enables network to model signal on different scales.
- Validated if dense blocks contribute to recovering target spectrogram by computing map-wise l2norm of filter weights.
- Comparing l2-norms of up-sampling path and skip connection path showed that every dense block at different scale contributes reasonably.
Model efficiency
- Proposed architecture encourages feature reuse, leading to a compact and efficient model
- Number of parameters of proposed architectures is significantly less than baseline methods
- MDenseNet achieved comparable performance to state-of-the-art BLEND with only 1.5% of the parameters
- MM-DenseNet outperformed BLEND with only 3.6% of the parameters
- Training time is significantly less than for BLSTM and BLEND methods
Conclusion
- Extended DenseNet to tackle audio source separation
- Proposed architectures have dense blocks at multiple scales connected through down-sampling and up-sampling layers
- Proposed multi-band DenseNet to enable kernels in convolution layer to learn more effectively
- Outperformed state-of-the-art by a large margin while reducing model size and training time