Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Sound matching algorithms use parametric audio synthesis to approximate a target waveform.
Deep neural networks have achieved good results in matching sustained harmonic tones.
Matching nonstationary and inharmonic targets (e.g. percussion) is more challenging.
Mean square error in the parametric domain (P-loss) is simple and fast, but doesn’t take into account the differing perceptual significance of each parameter.
Mean square error in the spectrotemporal domain (spectral loss) is perceptually motivated, but has more local minima and a slow convergence.
Perceptual-Neural-Physical loss (PNP) is an optimal quadratic approximation of spectral loss, while being as fast as P-loss during training.

Paper Content

Introduction

Sound matching is a task in computer science that involves retrieving the parameter setting of a sound to match a target sound.
It has applications in music transcription, virtual reality, and audio engineering.
Deep neural networks (DNN’s) have been used to formulate sound matching as a supervised learning problem.
The goal is to optimize the synaptic weights of a DNN so that it approximates the parameter setting of the target sound.
The paradigm of differentiable digital signal processing (DDSP) has been used to address this issue.

Methods

Approximating spectral loss with riemannian geometry

Assume synthesizer g and feature map Φ to be continuously differentiable
Denote by L DDSP the “spectral loss” associated to (Φ, f W , g)
Value of L DDSP at parameter set θ is given by equation 1
Conduct first-order Taylor expansion of (Φ • g) near θ
Differentiable map (Φ • g) induces weak Riemannian metric M onto open set U of parameters θ
Define perceptual-neural-physical loss (PNP) as linearization of spectral loss at θ
Gradient of PNP loss at given training pair (xn, θn) with respect to scalar weight Wi is given by equation 6
PNP is perceptually motivated extension of P-loss

Damped least squares

Principal components of Jacobian are eigenvectors of M
Eigenvectors form an orthonormal basis of R J
Damping term λI up-shifts all eigenvalues of M
L2 regularization with coefficient λ allows smooth transition between spectral and parameter loss regimes
λ can be scheduled or adaptively changed according to epoch validation loss

Application to drum sound matching

Perceptual: joint time-frequency scattering (jtfs)

JTFS is a nonlinear convolutional operator which extracts spectrotemporal modulations in the constant-Q scalogram
After convolution, pointwise complex modulus and temporal averaging is applied to each JTFS coefficient
JTFS is reminiscent of spectrotemporal receptive fields and may serve as a biologically plausible predictor of neurophysiological responses
Euclidean distances in Φ space predict auditory judgments of timbre similarity
JTFS is computed with Q1 = 12, Q2 = 1, and Q fr = 1 filters per octave, temporal averaging of T = 3 seconds and frequential averaging of F = 2 octaves, resulting in P = 20762 paths

Neural: deep convolutional network (convnet)

EfficientNet is a convolutional neural network architecture that balances the scaling of the depth, width and input resolution of consecutive convolutional blocks.
It achieves state-of-the-art performance on image classification with significantly less trainable parameters.
It is also successful in benchmarking audio classification tasks.
We adopt EfficientNet-B0 as our encoder, resulting in 4M learnable parameters.
We append a linear dense layer of neurons and a 1D batch normalization before tanh activation.
The input to the encoder is the log-scaled CQT coefficients of each example.

Physical: functional transformation method (ftm)

Interested in perpendicular displacement X(t, u) on a rectangular drum face
Solved from partial differential equation defined in Cartesian coordinate system
Standard traveling equation, fourth-order spatial and first-order time derivatives incorporate damping factors
Rectangular drum model capable of eliciting representative percussive sounds
Bound four sides of rectangular drum at zero at all time
Assume excitation function to be separable and localized in space and time
Implemented generator g as PDE solver to high-order damped wave equation
Reparametrized PDE parameters into θ
Prescribed sonically-plausible ranges for each parameter in θ

Results

Baselines

Trained fW with 3 different losses
Batch size of 64 samples for spectral loss, 256 samples for parameter and PNP loss
Training proceeds for 70 epochs
Adam optimizer with learning rate 10-3
Training time per epoch on a single Tesla V100 16GB GPU reported in Table 1

Evaluation with jtfs-based spectral loss

Use L2 norm of JTFS coefficients error for evaluation
Include average multi-scale spectral error for comparison
Euclidean JTFS distance includes spectrotemporal modulations
Both metrics measure perceptual closeness, not parametric retrieval accuracy

Discussion

Bear PNP loss form is nontrivial to apply
Numerical precision errors and extreme deformation of the optimization landscape may lead to numerical instability
Training PNP loss without damping λ = 0 leads to convergence issues
Eigenvalues of Ms range from 0 to 1020
Damping mechanisms used to update λ include constant λ, scheduled λ decay, and adaptive λ decay
Adaptive λ decay best performing model, λ initialized to 1020 and decayed to 3x1014 in 20 epochs
PNP loss able to suppress more errors in samples with high M(θ)j,j than parameter loss

Conclusion

PNP autoencoding is a bilinear form learning objective for sound matching tasks
PNP optimizes the retrieval of physical parameters from sounds in a perceptually-motivated metric space
PNP is mathematically similar to spectral loss and can transition between optimizing in parameter and spectral loss regimes
Damping mechanisms are used to facilitate learning under ill-conditioned empirical settings
Six models are trained with two modalities: pitch retrieval and choice of loss function
Best performing models are P-loss and PNP loss

Link to paper#

Abstract#

Paper Content#

Introduction#

Methods#

Approximating spectral loss with riemannian geometry#

Damped least squares#

Application to drum sound matching#

Perceptual: joint time-frequency scattering (jtfs)#

Neural: deep convolutional network (convnet)#

Physical: functional transformation method (ftm)#

Results#

Baselines#

Evaluation with jtfs-based spectral loss#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Methods

Approximating spectral loss with riemannian geometry

Damped least squares

Application to drum sound matching

Perceptual: joint time-frequency scattering (jtfs)

Neural: deep convolutional network (convnet)

Physical: functional transformation method (ftm)

Results

Baselines

Evaluation with jtfs-based spectral loss

Discussion

Conclusion