Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Sound matching algorithms use parametric audio synthesis to approximate a target waveform.
  • Deep neural networks have achieved good results in matching sustained harmonic tones.
  • Matching nonstationary and inharmonic targets (e.g. percussion) is more challenging.
  • Mean square error in the parametric domain (P-loss) is simple and fast, but doesn’t take into account the differing perceptual significance of each parameter.
  • Mean square error in the spectrotemporal domain (spectral loss) is perceptually motivated, but has more local minima and a slow convergence.
  • Perceptual-Neural-Physical loss (PNP) is an optimal quadratic approximation of spectral loss, while being as fast as P-loss during training.

Paper Content


  • Sound matching is a task in computer science that involves retrieving the parameter setting of a sound to match a target sound.
  • It has applications in music transcription, virtual reality, and audio engineering.
  • Deep neural networks (DNN’s) have been used to formulate sound matching as a supervised learning problem.
  • The goal is to optimize the synaptic weights of a DNN so that it approximates the parameter setting of the target sound.
  • The paradigm of differentiable digital signal processing (DDSP) has been used to address this issue.


Approximating spectral loss with riemannian geometry

  • Assume synthesizer g and feature map Φ to be continuously differentiable
  • Denote by L DDSP the “spectral loss” associated to (Φ, f W , g)
  • Value of L DDSP at parameter set θ is given by equation 1
  • Conduct first-order Taylor expansion of (Φ • g) near θ
  • Differentiable map (Φ • g) induces weak Riemannian metric M onto open set U of parameters θ
  • Define perceptual-neural-physical loss (PNP) as linearization of spectral loss at θ
  • Gradient of PNP loss at given training pair (xn, θn) with respect to scalar weight Wi is given by equation 6
  • PNP is perceptually motivated extension of P-loss

Damped least squares

  • Principal components of Jacobian are eigenvectors of M
  • Eigenvectors form an orthonormal basis of R J
  • Damping term λI up-shifts all eigenvalues of M
  • L2 regularization with coefficient λ allows smooth transition between spectral and parameter loss regimes
  • λ can be scheduled or adaptively changed according to epoch validation loss

Application to drum sound matching

Perceptual: joint time-frequency scattering (jtfs)

  • JTFS is a nonlinear convolutional operator which extracts spectrotemporal modulations in the constant-Q scalogram
  • After convolution, pointwise complex modulus and temporal averaging is applied to each JTFS coefficient
  • JTFS is reminiscent of spectrotemporal receptive fields and may serve as a biologically plausible predictor of neurophysiological responses
  • Euclidean distances in Φ space predict auditory judgments of timbre similarity
  • JTFS is computed with Q1 = 12, Q2 = 1, and Q fr = 1 filters per octave, temporal averaging of T = 3 seconds and frequential averaging of F = 2 octaves, resulting in P = 20762 paths

Neural: deep convolutional network (convnet)

  • EfficientNet is a convolutional neural network architecture that balances the scaling of the depth, width and input resolution of consecutive convolutional blocks.
  • It achieves state-of-the-art performance on image classification with significantly less trainable parameters.
  • It is also successful in benchmarking audio classification tasks.
  • We adopt EfficientNet-B0 as our encoder, resulting in 4M learnable parameters.
  • We append a linear dense layer of neurons and a 1D batch normalization before tanh activation.
  • The input to the encoder is the log-scaled CQT coefficients of each example.

Physical: functional transformation method (ftm)

  • Interested in perpendicular displacement X(t, u) on a rectangular drum face
  • Solved from partial differential equation defined in Cartesian coordinate system
  • Standard traveling equation, fourth-order spatial and first-order time derivatives incorporate damping factors
  • Rectangular drum model capable of eliciting representative percussive sounds
  • Bound four sides of rectangular drum at zero at all time
  • Assume excitation function to be separable and localized in space and time
  • Implemented generator g as PDE solver to high-order damped wave equation
  • Reparametrized PDE parameters into θ
  • Prescribed sonically-plausible ranges for each parameter in θ



  • Trained fW with 3 different losses
  • Batch size of 64 samples for spectral loss, 256 samples for parameter and PNP loss
  • Training proceeds for 70 epochs
  • Adam optimizer with learning rate 10-3
  • Training time per epoch on a single Tesla V100 16GB GPU reported in Table 1

Evaluation with jtfs-based spectral loss

  • Use L2 norm of JTFS coefficients error for evaluation
  • Include average multi-scale spectral error for comparison
  • Euclidean JTFS distance includes spectrotemporal modulations
  • Both metrics measure perceptual closeness, not parametric retrieval accuracy


  • Bear PNP loss form is nontrivial to apply
  • Numerical precision errors and extreme deformation of the optimization landscape may lead to numerical instability
  • Training PNP loss without damping λ = 0 leads to convergence issues
  • Eigenvalues of Ms range from 0 to 1020
  • Damping mechanisms used to update λ include constant λ, scheduled λ decay, and adaptive λ decay
  • Adaptive λ decay best performing model, λ initialized to 1020 and decayed to 3x1014 in 20 epochs
  • PNP loss able to suppress more errors in samples with high M(θ)j,j than parameter loss


  • PNP autoencoding is a bilinear form learning objective for sound matching tasks
  • PNP optimizes the retrieval of physical parameters from sounds in a perceptually-motivated metric space
  • PNP is mathematically similar to spectral loss and can transition between optimizing in parameter and spectral loss regimes
  • Damping mechanisms are used to facilitate learning under ill-conditioned empirical settings
  • Six models are trained with two modalities: pitch retrieval and choice of loss function
  • Best performing models are P-loss and PNP loss