Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Sound matching algorithms use parametric audio synthesis to approximate a target waveform.
- Deep neural networks have achieved good results in matching sustained harmonic tones.
- Matching nonstationary and inharmonic targets (e.g. percussion) is more challenging.
- Mean square error in the parametric domain (P-loss) is simple and fast, but doesn’t take into account the differing perceptual significance of each parameter.
- Mean square error in the spectrotemporal domain (spectral loss) is perceptually motivated, but has more local minima and a slow convergence.
- Perceptual-Neural-Physical loss (PNP) is an optimal quadratic approximation of spectral loss, while being as fast as P-loss during training.
Paper Content
Introduction
- Sound matching is a task in computer science that involves retrieving the parameter setting of a sound to match a target sound.
- It has applications in music transcription, virtual reality, and audio engineering.
- Deep neural networks (DNN’s) have been used to formulate sound matching as a supervised learning problem.
- The goal is to optimize the synaptic weights of a DNN so that it approximates the parameter setting of the target sound.
- The paradigm of differentiable digital signal processing (DDSP) has been used to address this issue.
Methods
Approximating spectral loss with riemannian geometry
- Assume synthesizer g and feature map Φ to be continuously differentiable
- Denote by L DDSP the “spectral loss” associated to (Φ, f W , g)
- Value of L DDSP at parameter set θ is given by equation 1
- Conduct first-order Taylor expansion of (Φ • g) near θ
- Differentiable map (Φ • g) induces weak Riemannian metric M onto open set U of parameters θ
- Define perceptual-neural-physical loss (PNP) as linearization of spectral loss at θ
- Gradient of PNP loss at given training pair (xn, θn) with respect to scalar weight Wi is given by equation 6
- PNP is perceptually motivated extension of P-loss
Damped least squares
- Principal components of Jacobian are eigenvectors of M
- Eigenvectors form an orthonormal basis of R J
- Damping term λI up-shifts all eigenvalues of M
- L2 regularization with coefficient λ allows smooth transition between spectral and parameter loss regimes
- λ can be scheduled or adaptively changed according to epoch validation loss
Application to drum sound matching
Perceptual: joint time-frequency scattering (jtfs)
- JTFS is a nonlinear convolutional operator which extracts spectrotemporal modulations in the constant-Q scalogram
- After convolution, pointwise complex modulus and temporal averaging is applied to each JTFS coefficient
- JTFS is reminiscent of spectrotemporal receptive fields and may serve as a biologically plausible predictor of neurophysiological responses
- Euclidean distances in Φ space predict auditory judgments of timbre similarity
- JTFS is computed with Q1 = 12, Q2 = 1, and Q fr = 1 filters per octave, temporal averaging of T = 3 seconds and frequential averaging of F = 2 octaves, resulting in P = 20762 paths
Neural: deep convolutional network (convnet)
- EfficientNet is a convolutional neural network architecture that balances the scaling of the depth, width and input resolution of consecutive convolutional blocks.
- It achieves state-of-the-art performance on image classification with significantly less trainable parameters.
- It is also successful in benchmarking audio classification tasks.
- We adopt EfficientNet-B0 as our encoder, resulting in 4M learnable parameters.
- We append a linear dense layer of neurons and a 1D batch normalization before tanh activation.
- The input to the encoder is the log-scaled CQT coefficients of each example.
Physical: functional transformation method (ftm)
- Interested in perpendicular displacement X(t, u) on a rectangular drum face
- Solved from partial differential equation defined in Cartesian coordinate system
- Standard traveling equation, fourth-order spatial and first-order time derivatives incorporate damping factors
- Rectangular drum model capable of eliciting representative percussive sounds
- Bound four sides of rectangular drum at zero at all time
- Assume excitation function to be separable and localized in space and time
- Implemented generator g as PDE solver to high-order damped wave equation
- Reparametrized PDE parameters into θ
- Prescribed sonically-plausible ranges for each parameter in θ
Results
Baselines
- Trained fW with 3 different losses
- Batch size of 64 samples for spectral loss, 256 samples for parameter and PNP loss
- Training proceeds for 70 epochs
- Adam optimizer with learning rate 10-3
- Training time per epoch on a single Tesla V100 16GB GPU reported in Table 1
Evaluation with jtfs-based spectral loss
- Use L2 norm of JTFS coefficients error for evaluation
- Include average multi-scale spectral error for comparison
- Euclidean JTFS distance includes spectrotemporal modulations
- Both metrics measure perceptual closeness, not parametric retrieval accuracy
Discussion
- Bear PNP loss form is nontrivial to apply
- Numerical precision errors and extreme deformation of the optimization landscape may lead to numerical instability
- Training PNP loss without damping λ = 0 leads to convergence issues
- Eigenvalues of Ms range from 0 to 1020
- Damping mechanisms used to update λ include constant λ, scheduled λ decay, and adaptive λ decay
- Adaptive λ decay best performing model, λ initialized to 1020 and decayed to 3x1014 in 20 epochs
- PNP loss able to suppress more errors in samples with high M(θ)j,j than parameter loss
Conclusion
- PNP autoencoding is a bilinear form learning objective for sound matching tasks
- PNP optimizes the retrieval of physical parameters from sounds in a perceptually-motivated metric space
- PNP is mathematically similar to spectral loss and can transition between optimizing in parameter and spectral loss regimes
- Damping mechanisms are used to facilitate learning under ill-conditioned empirical settings
- Six models are trained with two modalities: pitch retrieval and choice of loss function
- Best performing models are P-loss and PNP loss