Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Need for large-scale high-quality text datasets
  • BigScience workshop formed to research and train large language models
  • ROOTS corpus created, 1.6TB dataset spanning 59 languages
  • BLOOM language model trained using ROOTS corpus
  • Large initial subset of corpus released with processing tools

Paper Content

Introduction

  • BigScience1 is a one-year open collaborative research initiative
  • Goal was to train an open-access, massively multilingual language model
  • Engaged in ethical, sociopolitical, and data governance issues
  • Four working groups: Data Governance, Data Sourcing and Preparation, Privacy, Legal Scholarship
  • Released a large subset of ROOTS
  • Released data tools used to curate, source, clean and inspect constituent datasets

Outline of the paper

  • Collected a web-scale dataset covering 59 languages
  • 46 natural languages and 13 programming languages
  • 62% of text from community-selected and documented list of language data sources
  • 38% of text from pre-processed web crawl, OSCAR
  • Filtered with help of native speakers
  • Pre-trained models are used in natural language processing
  • Performance is based on model size and dataset size/quality
  • Recent models trained on up to 1.4 trillion tokens
  • Datasets are not usually released
  • Exceptions include Pile, C4, mC4, CC100, and OSCAR
  • Tooling, visualization, and replication of datasets is needed
  • Documentation of corpora is becoming more common

(crowd) sourcing a language resource catalogue

  • 62% of the dataset was made up of monolingual and multilingual language resources
  • Metadata was collected through open submissions and hackathons
  • 252 sources were collected, including at least 21 per language category
  • Additional Arabic language resources were gathered from the Masader repository
  • Websites were selected to increase geographical diversity of English, Spanish, and Chinese language data
  • Code data was selected from GitHub and StackExchange to test large language models’ ability to handle computer code

Obtaining data from the identified resources

  • Leveraged BigScience Catalogue and Masader repository to obtain text from identified sources
  • Established 2-phase approach to collect data sources and map them to a common format
  • Organized open hackathon to gather identified sources on Hugging Face Datasets hub
  • Language segmentation to obtain monolingual datasets
  • Documents consist of two fields: “text” and “meta”
  • Websites required a particular effort and dedicated pipeline
  • Pseudo-crawling used to retrieve pages from Common Crawl
  • Domain names from two sources: metadata and participants
  • Collected URLs from Common Crawl index
  • Extracted text from HTML pages
  • Collected code dataset from BigQuery
  • Manually inspected, deduplicated, and made further selection of sources
  • Removed datasets with high incidence of non-natural language

Processing pipeline for quality improvement on crowdsourced datasets

  • Attempted to improve quality of text from HTML
  • Applied processing pipeline to remove noisy data
  • Functions categorized as document-scoped or dataset-scoped
  • Cleaning functions remove text not part of main document
  • Filtering functions remove entire document from corpus
  • Built visualization tool to understand impact of each function

Processing oscar

  • Chose to complement data with Common Crawl-based data
  • Quantity of data is a strong factor in model performance
  • Used OSCAR version 21.09 to make up 38% of final dataset
  • Crawled data has issues such as machine-generated content and privacy risk
  • Used quality indicators to filter out specific pages

Deduplication

  • Data deduplication improves performance and decreases memorization of training data
  • Initially used SimHash to remove near duplicate documents
  • 0.7% of documents identified as near duplicates
  • False positives among long documents
  • Applied substring deduplication for documents with more than 6000 characters
  • 21.67% of data (in bytes) being duplicated

Personally identifiable information

  • Used rule-based approach with regular expressions
  • Redacted instances of KEY, EMAIL, USER, and IP_ADDRESS

A first look at roots

  • 1.6 Terabytes of multilingual text
  • Figure 4 compares the sizes of corpora used to train large language models
  • Interactive dataset card deck documents individual components of the corpus

Natural languages

  • Corpus created through crowdsourcing
  • 46 languages from 3 macroareas and 9 language families
  • English is the largest part of the corpus (30.03%)
  • Median size of a document is 1,129 bytes
  • Online interactive exploration tool available for more detailed breakdown

Programming languages

  • 13 programming languages are represented in the code subset of the corpus
  • Java, PHP, and C++ make up more than half of all documents
  • A heuristic is used to identify configuration and test files
  • 5.23% of the data consists of configuration files and 7.88% of test files
  • 10.9M duplicate files and 4.1M unique files are found in the clusters
  • 32% of the data consists of near-duplicates
  • Syntax checkers are used to validate 500K samples of Python and PHP code
  • 1% of the Python data and 2% of the PHP files do not pass the syntax check

Tokenizer analysis of the component datasets

  • A tokenizer trained on a dataset can be used to measure the number of tokens produced for a byte of natural language.
  • Outlier values, such as incorrectly classified languages or crawling errors, can be spotted using tokenizers.
  • Tokenizers trained on different corpora can be compared to see how they differ.
  • Tokenizers can be used to identify outlier components in a dataset.

Conclusion

  • ROOTS is a massive multilingual corpus created by an international collaboration of researchers
  • Data-first approach was used to train the BLOOM model
  • Tooling developed throughout the project is released
  • BigScience Research Workshop was conceived as a collaborative and value-driven endeavor
  • Core values chosen by the collaborators contributing to data efforts were articulated
  • Values include openness, reproducibility, responsibility, diversity, and inclusivity
  • Participatory approaches were used to bridge gaps between model development and deployment
  • Framework was developed to uphold rights and responsibilities of stakeholders
  • Use of data from Common Crawl is a point of tension between drive to present research artifact and values of consent and privacy
  • Limitations include use of data from Common Crawl and reliance on medium to large sources of digitized content
  • Tools used to obtain crowdsourced dataset include pseudocode to recreate text structure from HTML code, visualisation tool, and functions used in processing pipeline