Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Automatic music generation with AI requires a lot of data which is hard to obtain for less common genres and instruments.
  • Deep models can transfer knowledge from language to music by finetuning large language models pre-trained on a massive text corpus.
  • GPT3 is capable of generating reasonable drum grooves, while models not pre-trained show no such ability.
  • Evaluating generated music is challenging, especially drum grooves with little precedence in literature.
  • A tailored structural evaluation method is proposed and analyzed.
  • Language-to-music transfer learning with large language models is viable and promising.

Paper Content

Introduction

  • Music understanding and generation using AI has been studied for a long time
  • Music and language have similarities in their surface form
  • NLP techniques have been applied to music
  • Transformers have been used for symbolic music processing
  • Zeng et al. (2021) pre-trained a Transformer on a large symbolic music corpus
  • Low-resource symbolic music processing is challenging
  • Text-to-music transfer learning potential in LLMs is explored
  • Drum set is studied due to its prevalence and simpler symbolic representation
  • GPT3 model is finetuned on Groove dataset
  • GPT3 models can generate nontrivial drum grooves
  • Evaluation methodology proposed to evaluate machine-generated drum grooves

The drum set

  • Drum set is a compound musical instrument made up of drums and cymbals
  • Drum set typically includes hi-hat, crash cymbal, ride cymbal, bass drum, snare drum, and tom
  • Performance on drum set can be notated as sheet music or MIDI

Dataset

  • Google’s Groove MIDI Dataset is the largest and most high-quality dataset of drum performances
  • It contains 1,150 MIDI files and over 22,000 measures of drumming
  • The dataset is split into grooves and fills
  • We only consider grooves in the dataset
  • We simplify the dataset by quantizing notes to a 16-th note grid, discarding velocity information, reducing articulations to basic head hit, truncating each MIDI file to 16 measures, and removing empty leading measures

Task

  • Drum completion: given 2 measures, model must complete the rest of the 14 measures of the groove
  • Model may terminate at any point
  • Analogy to conditioned story generation in NLP

Model

  • Two naive baselines are considered
  • OpenAI’s GPT3 Davinci is finetuned on the training set
  • A smaller GPT3 Ada model is also considered
  • Language pre-training is found to be a necessary condition for effective drum generation

Evaluation

  • GPT3 can generate drum grooves
  • Evaluation of GPT3 drum grooves is done using both objective and subjective methods
  • Results are based on a test set

Objective evaluation

  • Automatic evaluation methods of symbolic music generation are not suitable for the task
  • Perplexity has low correlation with human perception
  • Structural similarity is often approximated with simple measures and discourages creativity
  • Pattern and fill analysis proposed to evaluate drum grooves
  • Criteria for good drum grooves: consistent patterns, similar measures in patterns, different measures in fills
  • Human-performed grooves mostly satisfy criteria
  • GPT3 models can generate nontrivial drum grooves, but less variation than human
  • K-means used to separate measures into two clusters, human-performed grooves have clearer pattern-versus-fill separation than GPT3

Subjective evaluation

  • Objective evaluation is insufficient
  • Listening test and error analysis conducted
  • Criteria: repetitive, consistent, chaotic, reasonable drum fill
  • Table 2 shows preliminary findings
  • Randomly generated grooves judged as chaotic
  • Human grooves judged as consistent and include at least one fill
  • GPT3 Davinci grooves slightly less consistent and fewer fills
  • GPT3 Ada grooves more inconsistent and fewer fills

Case study and takeaways

  • LLMs can write good drum grooves with only hundreds of training examples
  • LLMs and humans can compose drum grooves with structure and variations
  • Common source of lack of musicality in machine-generated drum grooves is misplacement of back-beats
  • Human drummers tend to respect back-beat placements, while models disregard them
  • Lack of drum fills is another source of lack of musicality
  • Repetition of drum grooves is accepted in some genres, but not in others
  • Future work should take a modular approach to reinforce strengths and alleviate weaknesses
  • Variation within drum grooves can be controlled by injecting additional labels in training data

Improvisation or recitation?

  • LLMs have the ability to be creative with drum composition
  • Question posed: are LLMs creating new drum grooves or regurgitating what they have seen?
  • Frequency of generated measures in test set compared to training set
  • Results show only a small portion of generated measures are duplicates, suggesting LLMs can compose novel and unseen drum grooves
  • Automatic music generation has been studied for a long time
  • Recent work has focused on using AI and deep neural networks
  • AI can aid or replace human composers and arrangers
  • Most modern work uses model architectures from computer vision or NLP
  • Automatic drum generation has a small body of existing work
  • Some work focuses on rhythm games
  • LLMs have been dominant in NLP tasks
  • Transfer learning between language and non-language tasks is worth studying

Conclusion and future work

  • LLMs can learn to generate music with only a few hundred symbolic music files
  • LLMs’ ability to generate music is attributed to model size and language pre-training
  • Automatic evaluation of drum grooves is possible
  • Velocity can be encoded in drumroll representation
  • Methodology can be ported to other instruments such as piano
  • Multi-instrument conditioned generation is more challenging