Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Converting time domain waveforms to frequency domain spectrograms is typically done before model training.
This approach has drawbacks, such as taking up a lot of hard disk space and needing to process audio clips again if another dataset is used.
This paper proposes a neural network based toolbox, nnAudio, which leverages 1D convolutional neural networks to perform time domain to frequency domain conversion during feed-forward.
nnAudio reduces the waveforms-to-spectrograms conversion time and further optimizes the existing CQT algorithm.

Paper Content

I. introduction

Spectrograms have been used as input for neural network models since the 1980s
Different types of spectrograms are tailored to different applications
Despite advances in end-to-end learning, many recent publications still use spectrograms as input
Applications include speech recognition, emotion detection, translation, enhancement, separation, conversion, tagging, cover detection, melody extraction, and transcription
Training an end-to-end model on raw audio data takes longer and yields slightly better performance
Finding the best spectrogram input requires trial and error and a lot of hard disk space
Existing methods for time-frequency conversion are too slow for on-the-fly spectrogram extraction
Tensorflow, Keras, and PyTorch have GPU-based spectrogram extraction methods
None of the existing libraries support constant-Q transform
nnAudio is a fast, differentiable, and trainable neural network-based audio processing framework
nnAudio is built using PyTorch and includes extended functionalities such as calculating Mel spectrograms and constant-Q transforms

A. summary of key advantages

Developed GPU-based audio processing framework
Leverages power of neural networks
End-to-end neural network training with on-the-fly time-frequency conversion layer
Significantly faster processing speed than traditional audio processing
Trainable Fourier, Mel and CQT kernels

Ii. signal processing: a quick overview

Transformation methods used to convert signals from time domain to frequency domain
Readers with signal processing background can skip this section

A. discrete fourier transform (dft)

Audio signals are converted to digital signals before being stored.
The Discrete Fourier Transform is used to convert the digital signal from the time domain to the frequency domain.
The frequency domain output is in reverse order and only the first half of the frequency bins are extracted.
The DFT is a split-sum of real and complex components.
The frequency k in the DFT is given in terms of normalized frequency.

B. dft for arbitrary frequency ranges

DFT is capable of resolving a finite number of distinct frequencies
Frequency resolution can be improved by increasing window size
Compromise between time and frequency resolution
Applying DFT followed by inverse-DFT results in perfect reconstruction of original signal
Modifying DFT to change frequencies of basis vectors
Using log-frequency spectrogram to focus resolution in needed frequency range
DFT and variable-resolution DFT used to calculate STFT using convolutional neural network

Iii. neural network-based framework

Calculate STFT, Mel spectrogram, and CQT using 1D convolutional neural network
Implement as library (nnAudio) in PyTorch 2
Assumes basic understanding of convolutional neural networks
Encodes audio processing knowledge into neurons of neural network
STFT is fundamental operation for Mel spectrogram and CQT
Multiply spectrogram by Mel filter bank kernel to convert to Mel spectrogram
Compute CQT by beginning with STFT, followed by multiplication with CQT kernel

A. short-time fourier transform (stft)

Short-time Fourier Transform (STFT) is an application of the DFT that cuts the signal into short windows before performing the transform
Standard way to apply the DFT for audio analysis applications
Cooley-tukey Fast Fourier Transform algorithm (FFT) computes the DFT in O(N log N ) operations
O(N 2 ) DFT often outperforms the O(N log N ) FFT for small values of N when the underlying platform supports fast vector multiplication
Neural network libraries typically include fast GPU-optimised convolution functions, which can be used to compute the canonical DFT quickly
Discrete linear convolution of a kernel h with a signal x is defined as follows
PyTorch defines a convolution function with a stride argument
Can use convolution with stride to make fast GPU-based implementations of the STFT
DFT is usually computed with a function that fades the samples at the edges of each window smoothly down to near zero
Typical DFT window functions include Hann, Hamming, and Blackman types
Can implement window smoothing efficiently by multiplying window function elementwise with the filter kernels before doing the convolution

B. mel spectrogram

Mel frequency scale was proposed in 1937 to quantify pitch
Several attempts to revise the Mel scale have been made
There is not one “right” formula for the Mel scale
Traditional frequency to Mel scale conversion is mentioned in O’Shaughnessy’s book
Another form of the Mel scale is used in the Auditory Toolbox for MATLAB and librosa
Mel filter banks are multiplied to each timestep of the STFT result to obtain a Mel spectrogram
PyTorch 1D convolutional neural network is used to obtain the STFT results
Mel filter banks are used to initialize the weights of a single-layer fully-connected neural network

C. constant-q transform 1) a quick overview of the constant-q transform (1992 version)

Musical pitches have a logarithmic relationship
Spectrograms need to use a logarithmic frequency scale
Naive solution is to modify frequencies of DFT basis functions
Orthogonality of basis is lost, leading to complicated energy relationship
Gaps between frequency bins at upper end of spectrum
Overlap between bins at lower end of spectrum
Constant-Q transform (CQT) proposed by Brown in 1991
CQT uses a logarithmic frequency scale
CQT maintains constant frequency resolution by keeping constant Q value
CQT and logarithmic frequency STFT have subtle differences
CQT can be implemented with 1D convolutional neural network
CQT kernels can be transformed to frequency domain
CQT kernels can be huge if low frequency is used
nnAudio API provides CQT function

3) downsampling

Neural networks can be used to downsample input audio clips by a factor of two without aliasing.
A low pass filter is required to filter out any frequencies above the downsampled Nyquist frequency.
Finite impulse response filtering (FIR) is used to implement the low pass filter.
The Window Method is used to design the low-pass FIR filter kernel.
The CQT kernels are kept the same while the input audio is being downsampled recursively.

5) cqt with time domain kernels

Brown and Puckette proposed an algorithm in 1992 that was more efficient in terms of memory
Converting time domain CQT kernels to frequency domain kernels makes them more memory efficient
With modern technology, memory is no longer an issue, so it is no longer necessary to convert the time domain CQT kernels to frequency domain kernels
This modification improves the CQT computation speed

Iv. experimental results

Analyzed speed and accuracy of proposed framework nnAudio
Compared nnAudio to existing audio processing library librosa
Compared STFT, Mel Spectrogram, and CQT implementations
Compared to other existing GPU audio processing libraries
Tested speed to process 1,770 audio files
Tested correctness of resulting spectrograms
Discussed CQT1992v2 and CQT2010v2 implementations

A. speed

1) setup

Used MAPS dataset to benchmark nnAudio
Discarded first 20,000 samples from each audio excerpt to remove silence
Each audio excerpt kept same length (80,000 samples)
Goal of speed test to convert array of waveforms to array of spectrograms
Tested on 3 different machines
Compared speed of nnAudio to librosa, Essentia, Kapre, Tensorflow, and Torchaudio
Librosa with multi-process not used
Caching disabled for librosa
Only one GPU used for nnAudio
nnAudio faster than librosa regardless of machine
RTX 2080 Ti GPU performance similar to Tesla v100 GPU
Extra time required to transfer kernels from RAM to GPU memory
Initialization time influenced by kernel sizes of STFT, MelSpec, and CQT
CQT2010 faster initialization time than CQT1992
STFT, MelSpec, and CQT speeds shown in Figure 11
CQT1992v2 and CQT2010v2 improve computation speed
CQT2010v2 used for computers without GPU
CQT1992v2 used when GPU available
nnAudio faster than torchaudio and Kapre
NVIDIA DALI not included as it does not calculate gradient for spectrograms

B. conversion output 1) setup

Librosa is used as a benchmark to check the correctness of the implementation
Four input signals are used to determine the model output correctness
The chromatic piano scale is recorded with a piano instrument
CQT1992v2 is faster and better quality than CQT2010 and CQT2010v2

2) results

Results of accuracy test shown in Figures 12 and 13
STFT and MelSpec results from librosa and nnAudio are very similar
CQT output for nnAudio is smoother
Comparison between CQT1992v2, CQT2010v2 and librosa’s CQT shown in Figure 14
Aliasing due to downsampling is observable
CQT1992v2 is fastest and best of all proposed GPU-based CQT implementations

V. example applications

nnAudio can be used to explore different spectrograms as input for music transcription models.
nnAudio allows STFT kernels to be trained/finetuned for better spectrograms.

A. exploring different input representations

Music transcription can be done quickly with nnAudio
nnAudio can explore different types of spectrograms as input for a neural network
Four types of spectrograms are explored: linear frequency scale, logarithmic frequency scale, Mel spectrogram, and CQT
Each spectrogram has different parameter settings to consider
34 different input representations are explored
Transcription accuracy can be measured with mir_eval.multipitch.metrics()
Input representation and its settings have a big influence on model performance
nnAudio enables fast comparison of different representations
nnAudio can be faster than pre-processed approach
Time required for experimental phase can be drastically reduced with nnAudio

Index

nnAudio allows us to integrate audio processing into a neural network model
nnAudio allows us to store audio clips in the original waveform, without saving extra copies of the dataset for the spectrograms
nnAudio is useful when the dataset is large and takes tens of hours to convert the data from waveforms to spectrograms

B. trainable transformation kernels

STFT and MelSpec can be implemented with a 1D convolutional neural network
Gradient descent can be used to finetune the kernels and filter banks
STFT window size is set to 64 to make the task non-trivial
10,925 pure sine waves with different frequencies are generated to form the dataset
80% of the sine waves are used as the training set, 20% as the test set
Two models are used to predict the frequency of the input signal: a fully connected network and a 2D convolutional neural network
Trainable transformation layer results in a lower mean square error
Trained Fourier kernels contain higher frequencies on top of the fundamental frequency
Trainable Mel filter banks and CQT kernels allow a richer spectrogram to be obtained
Higher-performing end-to-end model can be obtained with a trainable transformation layer

Vi. conclusion

Presented a new framework for extracting spectrograms with neural networks
Allows for dynamic training of kernels
Implemented as GPU-based library, nnAudio
Time domain to frequency domain transformation algorithms implemented in PyTorch
GPU audio processing reduces time to convert waveforms to spectrograms
Trainable kernels result in better model on frequency prediction task
Proposed framework is first to support CQT
CQT algorithm using time domain kernels performs faster than frequency domain kernels
Computation time reduced drastically with nnAudio on real dataset, MusicNet

Link to paper#

Abstract#

Paper Content#

I. introduction#

A. summary of key advantages#

Ii. signal processing: a quick overview#

A. discrete fourier transform (dft)#

B. dft for arbitrary frequency ranges#

Iii. neural network-based framework#

A. short-time fourier transform (stft)#

B. mel spectrogram#

C. constant-q transform 1) a quick overview of the constant-q transform (1992 version)#

3) downsampling#

5) cqt with time domain kernels#

Iv. experimental results#

A. speed#

1) setup#

B. conversion output 1) setup#

2) results#

V. example applications#

A. exploring different input representations#

Index#

B. trainable transformation kernels#

Vi. conclusion#

Link to paper

Abstract

Paper Content

I. introduction

A. summary of key advantages

Ii. signal processing: a quick overview

A. discrete fourier transform (dft)

B. dft for arbitrary frequency ranges

Iii. neural network-based framework

A. short-time fourier transform (stft)

B. mel spectrogram

C. constant-q transform 1) a quick overview of the constant-q transform (1992 version)

3) downsampling

5) cqt with time domain kernels

Iv. experimental results

A. speed

1) setup

B. conversion output 1) setup

2) results

V. example applications

A. exploring different input representations

Index

B. trainable transformation kernels

Vi. conclusion