Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • BabyLM Challenge is a shared task for computer science research related to language modeling, human language acquisition, low-resource NLP, and cognitive modeling.
  • Three tracks are available, two of which restrict the training data to pre-released datasets of 10M and 100M words.
  • The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality.
  • A shared evaluation pipeline will be released to score models on a variety of benchmarks and tasks.

Paper Content

Motivation

  • Huge efforts have been put into optimizing LM pretraining
  • Datasets have grown by orders of magnitude
  • Little progress in pretraining at smaller human-like data scales
  • Small-scale pretraining can be a sandbox for developing novel techniques
  • Improving ability to train LMs on same data humans learn from
  • Shared task to incentivize researchers to focus on optimizing pretraining
  • STRICT and STRICT-SMALL tracks with different dataset sizes
  • LOOSE track allows unlimited non-linguistic data or text

Dataset

  • Developmentally plausible pretraining dataset inspired by input to children
  • Must use only this training data for STRICT(-SMALL) tracks, different data for LOOSE track
  • Under 100M words
  • Mostly transcribed speech

Evaluation

  • Evaluation pipeline based on Google Colab
  • Evaluation code is public
  • Models must be able to score sequences and be fine-tuned for classification tasks
  • Submissions must include model outputs for core evaluations

Baselines

  • Release baseline models with evaluation pipeline
  • Hyperparameters from established large language models
  • Download model, predictions, and additional data
  • Estimate resources required to pretrain on 10M and 100M words

Faqs

  • Papers can be submitted to multiple tracks
  • Participants encouraged to submit reports
  • Additional evaluation metrics can be submitted
  • Any kind of training objective/regime is permitted
  • No limits on hyperparameters or number of epochs

Organizing committee

  • Modern Language Models are trained on data much larger than the amount available to a typical human child.
  • The BabyLM Challenge includes two tracks with different amounts of data.