Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Automatic music generation with AI requires a lot of data which is hard to obtain for less common genres and instruments.
- Deep models can transfer knowledge from language to music by finetuning large language models pre-trained on a massive text corpus.
- GPT3 is capable of generating reasonable drum grooves, while models not pre-trained show no such ability.
- Evaluating generated music is challenging, especially drum grooves with little precedence in literature.
- A tailored structural evaluation method is proposed and analyzed.
- Language-to-music transfer learning with large language models is viable and promising.
Paper Content
Introduction
- Music understanding and generation using AI has been studied for a long time
- Music and language have similarities in their surface form
- NLP techniques have been applied to music
- Transformers have been used for symbolic music processing
- Zeng et al. (2021) pre-trained a Transformer on a large symbolic music corpus
- Low-resource symbolic music processing is challenging
- Text-to-music transfer learning potential in LLMs is explored
- Drum set is studied due to its prevalence and simpler symbolic representation
- GPT3 model is finetuned on Groove dataset
- GPT3 models can generate nontrivial drum grooves
- Evaluation methodology proposed to evaluate machine-generated drum grooves
The drum set
- Drum set is a compound musical instrument made up of drums and cymbals
- Drum set typically includes hi-hat, crash cymbal, ride cymbal, bass drum, snare drum, and tom
- Performance on drum set can be notated as sheet music or MIDI
Dataset
- Google’s Groove MIDI Dataset is the largest and most high-quality dataset of drum performances
- It contains 1,150 MIDI files and over 22,000 measures of drumming
- The dataset is split into grooves and fills
- We only consider grooves in the dataset
- We simplify the dataset by quantizing notes to a 16-th note grid, discarding velocity information, reducing articulations to basic head hit, truncating each MIDI file to 16 measures, and removing empty leading measures
Task
- Drum completion: given 2 measures, model must complete the rest of the 14 measures of the groove
- Model may terminate at any point
- Analogy to conditioned story generation in NLP
Model
- Two naive baselines are considered
- OpenAI’s GPT3 Davinci is finetuned on the training set
- A smaller GPT3 Ada model is also considered
- Language pre-training is found to be a necessary condition for effective drum generation
Evaluation
- GPT3 can generate drum grooves
- Evaluation of GPT3 drum grooves is done using both objective and subjective methods
- Results are based on a test set
Objective evaluation
- Automatic evaluation methods of symbolic music generation are not suitable for the task
- Perplexity has low correlation with human perception
- Structural similarity is often approximated with simple measures and discourages creativity
- Pattern and fill analysis proposed to evaluate drum grooves
- Criteria for good drum grooves: consistent patterns, similar measures in patterns, different measures in fills
- Human-performed grooves mostly satisfy criteria
- GPT3 models can generate nontrivial drum grooves, but less variation than human
- K-means used to separate measures into two clusters, human-performed grooves have clearer pattern-versus-fill separation than GPT3
Subjective evaluation
- Objective evaluation is insufficient
- Listening test and error analysis conducted
- Criteria: repetitive, consistent, chaotic, reasonable drum fill
- Table 2 shows preliminary findings
- Randomly generated grooves judged as chaotic
- Human grooves judged as consistent and include at least one fill
- GPT3 Davinci grooves slightly less consistent and fewer fills
- GPT3 Ada grooves more inconsistent and fewer fills
Case study and takeaways
- LLMs can write good drum grooves with only hundreds of training examples
- LLMs and humans can compose drum grooves with structure and variations
- Common source of lack of musicality in machine-generated drum grooves is misplacement of back-beats
- Human drummers tend to respect back-beat placements, while models disregard them
- Lack of drum fills is another source of lack of musicality
- Repetition of drum grooves is accepted in some genres, but not in others
- Future work should take a modular approach to reinforce strengths and alleviate weaknesses
- Variation within drum grooves can be controlled by injecting additional labels in training data
Improvisation or recitation?
- LLMs have the ability to be creative with drum composition
- Question posed: are LLMs creating new drum grooves or regurgitating what they have seen?
- Frequency of generated measures in test set compared to training set
- Results show only a small portion of generated measures are duplicates, suggesting LLMs can compose novel and unseen drum grooves
Related work
- Automatic music generation has been studied for a long time
- Recent work has focused on using AI and deep neural networks
- AI can aid or replace human composers and arrangers
- Most modern work uses model architectures from computer vision or NLP
- Automatic drum generation has a small body of existing work
- Some work focuses on rhythm games
- LLMs have been dominant in NLP tasks
- Transfer learning between language and non-language tasks is worth studying
Conclusion and future work
- LLMs can learn to generate music with only a few hundred symbolic music files
- LLMs’ ability to generate music is attributed to model size and language pre-training
- Automatic evaluation of drum grooves is possible
- Velocity can be encoded in drumroll representation
- Methodology can be ported to other instruments such as piano
- Multi-instrument conditioned generation is more challenging