Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Automatic Music Transcription (AMT) is a key technology with many applications.
Instrument-specific systems tend to be more accurate than instrument-agnostic methods.
Estimating frame-wise $f_0$ values is easier than note event detection.
This paper proposes a lightweight neural network for musical instrument transcription.
The model is trained to predict onsets, multipitch and note activations.
The system is substantially better than a comparable baseline.

Paper Content

Introduction

Automatic transcription of music has been studied for more than four decades
Systems have improved since the rise of deep learning, but the task remains unsolved
AMT systems are usually designed with a limited scope, focusing on a sub-task
Sub-tasks branch along three dimensions: output polyphony, output type, and input audio
Specializing for a specific instrument class can increase accuracy
Deploying a number of specialized systems can be intractable
This paper considers an instrument-agnostic polyphonic AMT model
Model is lightweight and runs efficiently on low-end devices
Model is jointly trained to predict frame-level onset, multipitch and note posteriorgrams
Model is evaluated against a recent baseline model
Code and trained models are made publicly available

AMT systems have three dimensions: degree of output polyphony, type of output estimated, and type of input audio
Polyphonic AMT supports monophonic sources
Outputs are typically frame-level multipitch estimation or note-level estimation
Note estimation is difficult for singing voice
Transformers have been applied to AMT to predict MIDI-like note events from spectrograms
Traditional AMT methods are more generalizable to multiple instruments than more recent approaches

Model

Goal is to create a lightweight AMT model that generalizes across polyphonic/monophonic instruments without retraining
Model is shallow to keep memory needs low
Uses Constant-Q Transform (CQT) with 3 bins per semitone and a hop size of ≈ 11 ms
Uses Harmonic CQT (HCQT) to align harmonically-related frequencies along a third dimension
Architecture estimates Yp, Yn, and Yo
Binary cross entropy is used as the loss function for each output
Post-processing step creates note events by peak picking Yo and tracking forward/backward in time through Yn
Multi-pitch estimates created by peak picking Yp across frequency

Experiments

NMP is evaluated using metrics proposed for MIREX3 evaluation tasks
Fmeasure, Fmeasure-no-offset, and frame-level note accuracy are used to measure performance
Fmeasure-no-offset is the main measure of overall note estimation accuracy
Training and test data spans multiple instrument types
5% of tracks from the training set are used for validation

Note transcription baseline comparison

NMP outperforms MI-AMT on all test datasets and metrics, except for comparable Acc on MAESTRO and Slakh.
NMP performs strongly for both polyphonic and monophonic instruments.
NMP performs well across datasets with varying instrument types.

Ablation experiments

Harmonic Stacking improves performance when used as an input representation
Omitting Harmonic Stacking reduces performance
Supervised bottleneck layer Yp improves accuracy across all datasets
Model outperforms instrument-agnostic baseline on various datasets
Comparison with open-source instrument-specific models shows upper limits of model

Comparison with instrument-specific approaches

NMP outperforms TENT for all metrics on guitar
NMP and Vocano have comparable frame-level pitch accuracy on vocals
OF outperforms NMP on MAESTRO dataset
Main difference in performance is due to onset detection accuracy
NMP would perform competitively with Melodyne5 on piano data

Mpe baseline

NMP performs better than deep salience on Bach 10 dataset
Deep salience performs better than NMP on Su dataset
NMP’s 3-bin-per-semitone resolution posteriorgrams can be used to estimate continuous multi pitch estimates

Efficiency

NMP and MI-AMT were compared in terms of peak memory usage and total run time on a 2017 Macbook Pro
NMP and MI-AMT had similar overhead, but NMP outperformed MI-AMT on the long file
Instrument-specific models had higher peak memory usage than NMP and MI-AMT

Conclusions

Proposed low-resource neural network-based model (NMP) can be applied to instrument-agnostic polyphonic note transcription and MPE
NMP outperforms a recent strong baseline note estimation model across five different datasets
NMP performs similarly to deep salience for MPE
Harmonic stacking allows model to remain low-resource while maintaining performance
NMP achieves state-of-the-art results on GuitarSet, but not on piano and vocals

Link to paper#

Abstract#

Paper Content#

Introduction#

Background and related work#

Model#

Experiments#

Note transcription baseline comparison#

Ablation experiments#

Comparison with instrument-specific approaches#

Mpe baseline#

Efficiency#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Background and related work

Model

Experiments

Note transcription baseline comparison

Ablation experiments

Comparison with instrument-specific approaches

Mpe baseline

Efficiency

Conclusions