Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Automatic Music Transcription (AMT) is a key technology with many applications.
  • Instrument-specific systems tend to be more accurate than instrument-agnostic methods.
  • Estimating frame-wise $f_0$ values is easier than note event detection.
  • This paper proposes a lightweight neural network for musical instrument transcription.
  • The model is trained to predict onsets, multipitch and note activations.
  • The system is substantially better than a comparable baseline.

Paper Content

Introduction

  • Automatic transcription of music has been studied for more than four decades
  • Systems have improved since the rise of deep learning, but the task remains unsolved
  • AMT systems are usually designed with a limited scope, focusing on a sub-task
  • Sub-tasks branch along three dimensions: output polyphony, output type, and input audio
  • Specializing for a specific instrument class can increase accuracy
  • Deploying a number of specialized systems can be intractable
  • This paper considers an instrument-agnostic polyphonic AMT model
  • Model is lightweight and runs efficiently on low-end devices
  • Model is jointly trained to predict frame-level onset, multipitch and note posteriorgrams
  • Model is evaluated against a recent baseline model
  • Code and trained models are made publicly available
  • AMT systems have three dimensions: degree of output polyphony, type of output estimated, and type of input audio
  • Polyphonic AMT supports monophonic sources
  • Outputs are typically frame-level multipitch estimation or note-level estimation
  • Note estimation is difficult for singing voice
  • Transformers have been applied to AMT to predict MIDI-like note events from spectrograms
  • Traditional AMT methods are more generalizable to multiple instruments than more recent approaches

Model

  • Goal is to create a lightweight AMT model that generalizes across polyphonic/monophonic instruments without retraining
  • Model is shallow to keep memory needs low
  • Uses Constant-Q Transform (CQT) with 3 bins per semitone and a hop size of ≈ 11 ms
  • Uses Harmonic CQT (HCQT) to align harmonically-related frequencies along a third dimension
  • Architecture estimates Yp, Yn, and Yo
  • Binary cross entropy is used as the loss function for each output
  • Post-processing step creates note events by peak picking Yo and tracking forward/backward in time through Yn
  • Multi-pitch estimates created by peak picking Yp across frequency

Experiments

  • NMP is evaluated using metrics proposed for MIREX3 evaluation tasks
  • Fmeasure, Fmeasure-no-offset, and frame-level note accuracy are used to measure performance
  • Fmeasure-no-offset is the main measure of overall note estimation accuracy
  • Training and test data spans multiple instrument types
  • 5% of tracks from the training set are used for validation

Note transcription baseline comparison

  • NMP outperforms MI-AMT on all test datasets and metrics, except for comparable Acc on MAESTRO and Slakh.
  • NMP performs strongly for both polyphonic and monophonic instruments.
  • NMP performs well across datasets with varying instrument types.

Ablation experiments

  • Harmonic Stacking improves performance when used as an input representation
  • Omitting Harmonic Stacking reduces performance
  • Supervised bottleneck layer Yp improves accuracy across all datasets
  • Model outperforms instrument-agnostic baseline on various datasets
  • Comparison with open-source instrument-specific models shows upper limits of model

Comparison with instrument-specific approaches

  • NMP outperforms TENT for all metrics on guitar
  • NMP and Vocano have comparable frame-level pitch accuracy on vocals
  • OF outperforms NMP on MAESTRO dataset
  • Main difference in performance is due to onset detection accuracy
  • NMP would perform competitively with Melodyne5 on piano data

Mpe baseline

  • NMP performs better than deep salience on Bach 10 dataset
  • Deep salience performs better than NMP on Su dataset
  • NMP’s 3-bin-per-semitone resolution posteriorgrams can be used to estimate continuous multi pitch estimates

Efficiency

  • NMP and MI-AMT were compared in terms of peak memory usage and total run time on a 2017 Macbook Pro
  • NMP and MI-AMT had similar overhead, but NMP outperformed MI-AMT on the long file
  • Instrument-specific models had higher peak memory usage than NMP and MI-AMT

Conclusions

  • Proposed low-resource neural network-based model (NMP) can be applied to instrument-agnostic polyphonic note transcription and MPE
  • NMP outperforms a recent strong baseline note estimation model across five different datasets
  • NMP performs similarly to deep salience for MPE
  • Harmonic stacking allows model to remain low-resource while maintaining performance
  • NMP achieves state-of-the-art results on GuitarSet, but not on piano and vocals