Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- BabyLM Challenge is a shared task for computer science research related to language modeling, human language acquisition, low-resource NLP, and cognitive modeling.
- Three tracks are available, two of which restrict the training data to pre-released datasets of 10M and 100M words.
- The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality.
- A shared evaluation pipeline will be released to score models on a variety of benchmarks and tasks.
Paper Content
Motivation
- Huge efforts have been put into optimizing LM pretraining
- Datasets have grown by orders of magnitude
- Little progress in pretraining at smaller human-like data scales
- Small-scale pretraining can be a sandbox for developing novel techniques
- Improving ability to train LMs on same data humans learn from
- Shared task to incentivize researchers to focus on optimizing pretraining
- STRICT and STRICT-SMALL tracks with different dataset sizes
- LOOSE track allows unlimited non-linguistic data or text
Dataset
- Developmentally plausible pretraining dataset inspired by input to children
- Must use only this training data for STRICT(-SMALL) tracks, different data for LOOSE track
- Under 100M words
- Mostly transcribed speech
Evaluation
- Evaluation pipeline based on Google Colab
- Evaluation code is public
- Models must be able to score sequences and be fine-tuned for classification tasks
- Submissions must include model outputs for core evaluations
Baselines
- Release baseline models with evaluation pipeline
- Hyperparameters from established large language models
- Download model, predictions, and additional data
- Estimate resources required to pretrain on 10M and 100M words
Faqs
- Papers can be submitted to multiple tracks
- Participants encouraged to submit reports
- Additional evaluation metrics can be submitted
- Any kind of training objective/regime is permitted
- No limits on hyperparameters or number of epochs
Organizing committee
- Modern Language Models are trained on data much larger than the amount available to a typical human child.
- The BabyLM Challenge includes two tracks with different amounts of data.