Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automatic Music Transcription (AMT) is a key technology with many applications.
- Instrument-specific systems tend to be more accurate than instrument-agnostic methods.
- Estimating frame-wise $f_0$ values is easier than note event detection.
- This paper proposes a lightweight neural network for musical instrument transcription.
- The model is trained to predict onsets, multipitch and note activations.
- The system is substantially better than a comparable baseline.
Paper Content
Introduction
- Automatic transcription of music has been studied for more than four decades
- Systems have improved since the rise of deep learning, but the task remains unsolved
- AMT systems are usually designed with a limited scope, focusing on a sub-task
- Sub-tasks branch along three dimensions: output polyphony, output type, and input audio
- Specializing for a specific instrument class can increase accuracy
- Deploying a number of specialized systems can be intractable
- This paper considers an instrument-agnostic polyphonic AMT model
- Model is lightweight and runs efficiently on low-end devices
- Model is jointly trained to predict frame-level onset, multipitch and note posteriorgrams
- Model is evaluated against a recent baseline model
- Code and trained models are made publicly available
Background and related work
- AMT systems have three dimensions: degree of output polyphony, type of output estimated, and type of input audio
- Polyphonic AMT supports monophonic sources
- Outputs are typically frame-level multipitch estimation or note-level estimation
- Note estimation is difficult for singing voice
- Transformers have been applied to AMT to predict MIDI-like note events from spectrograms
- Traditional AMT methods are more generalizable to multiple instruments than more recent approaches
Model
- Goal is to create a lightweight AMT model that generalizes across polyphonic/monophonic instruments without retraining
- Model is shallow to keep memory needs low
- Uses Constant-Q Transform (CQT) with 3 bins per semitone and a hop size of ≈ 11 ms
- Uses Harmonic CQT (HCQT) to align harmonically-related frequencies along a third dimension
- Architecture estimates Yp, Yn, and Yo
- Binary cross entropy is used as the loss function for each output
- Post-processing step creates note events by peak picking Yo and tracking forward/backward in time through Yn
- Multi-pitch estimates created by peak picking Yp across frequency
Experiments
- NMP is evaluated using metrics proposed for MIREX3 evaluation tasks
- Fmeasure, Fmeasure-no-offset, and frame-level note accuracy are used to measure performance
- Fmeasure-no-offset is the main measure of overall note estimation accuracy
- Training and test data spans multiple instrument types
- 5% of tracks from the training set are used for validation
Note transcription baseline comparison
- NMP outperforms MI-AMT on all test datasets and metrics, except for comparable Acc on MAESTRO and Slakh.
- NMP performs strongly for both polyphonic and monophonic instruments.
- NMP performs well across datasets with varying instrument types.
Ablation experiments
- Harmonic Stacking improves performance when used as an input representation
- Omitting Harmonic Stacking reduces performance
- Supervised bottleneck layer Yp improves accuracy across all datasets
- Model outperforms instrument-agnostic baseline on various datasets
- Comparison with open-source instrument-specific models shows upper limits of model
Comparison with instrument-specific approaches
- NMP outperforms TENT for all metrics on guitar
- NMP and Vocano have comparable frame-level pitch accuracy on vocals
- OF outperforms NMP on MAESTRO dataset
- Main difference in performance is due to onset detection accuracy
- NMP would perform competitively with Melodyne5 on piano data
Mpe baseline
- NMP performs better than deep salience on Bach 10 dataset
- Deep salience performs better than NMP on Su dataset
- NMP’s 3-bin-per-semitone resolution posteriorgrams can be used to estimate continuous multi pitch estimates
Efficiency
- NMP and MI-AMT were compared in terms of peak memory usage and total run time on a 2017 Macbook Pro
- NMP and MI-AMT had similar overhead, but NMP outperformed MI-AMT on the long file
- Instrument-specific models had higher peak memory usage than NMP and MI-AMT
Conclusions
- Proposed low-resource neural network-based model (NMP) can be applied to instrument-agnostic polyphonic note transcription and MPE
- NMP outperforms a recent strong baseline note estimation model across five different datasets
- NMP performs similarly to deep salience for MPE
- Harmonic stacking allows model to remain low-resource while maintaining performance
- NMP achieves state-of-the-art results on GuitarSet, but not on piano and vocals