Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Need for large-scale high-quality text datasets
BigScience workshop formed to research and train large language models
ROOTS corpus created, 1.6TB dataset spanning 59 languages
BLOOM language model trained using ROOTS corpus
Large initial subset of corpus released with processing tools

Paper Content

Introduction

BigScience1 is a one-year open collaborative research initiative
Goal was to train an open-access, massively multilingual language model
Engaged in ethical, sociopolitical, and data governance issues
Four working groups: Data Governance, Data Sourcing and Preparation, Privacy, Legal Scholarship
Released a large subset of ROOTS
Released data tools used to curate, source, clean and inspect constituent datasets

Outline of the paper

Collected a web-scale dataset covering 59 languages
46 natural languages and 13 programming languages
62% of text from community-selected and documented list of language data sources
38% of text from pre-processed web crawl, OSCAR
Filtered with help of native speakers

Pre-trained models are used in natural language processing
Performance is based on model size and dataset size/quality
Recent models trained on up to 1.4 trillion tokens
Datasets are not usually released
Exceptions include Pile, C4, mC4, CC100, and OSCAR
Tooling, visualization, and replication of datasets is needed
Documentation of corpora is becoming more common

(crowd) sourcing a language resource catalogue

62% of the dataset was made up of monolingual and multilingual language resources
Metadata was collected through open submissions and hackathons
252 sources were collected, including at least 21 per language category
Additional Arabic language resources were gathered from the Masader repository
Websites were selected to increase geographical diversity of English, Spanish, and Chinese language data
Code data was selected from GitHub and StackExchange to test large language models’ ability to handle computer code

Obtaining data from the identified resources

Leveraged BigScience Catalogue and Masader repository to obtain text from identified sources
Established 2-phase approach to collect data sources and map them to a common format
Organized open hackathon to gather identified sources on Hugging Face Datasets hub
Language segmentation to obtain monolingual datasets
Documents consist of two fields: “text” and “meta”
Websites required a particular effort and dedicated pipeline
Pseudo-crawling used to retrieve pages from Common Crawl
Domain names from two sources: metadata and participants
Collected URLs from Common Crawl index
Extracted text from HTML pages
Collected code dataset from BigQuery
Manually inspected, deduplicated, and made further selection of sources
Removed datasets with high incidence of non-natural language

Processing pipeline for quality improvement on crowdsourced datasets

Attempted to improve quality of text from HTML
Applied processing pipeline to remove noisy data
Functions categorized as document-scoped or dataset-scoped
Cleaning functions remove text not part of main document
Filtering functions remove entire document from corpus
Built visualization tool to understand impact of each function

Processing oscar

Chose to complement data with Common Crawl-based data
Quantity of data is a strong factor in model performance
Used OSCAR version 21.09 to make up 38% of final dataset
Crawled data has issues such as machine-generated content and privacy risk
Used quality indicators to filter out specific pages

Deduplication

Data deduplication improves performance and decreases memorization of training data
Initially used SimHash to remove near duplicate documents
0.7% of documents identified as near duplicates
False positives among long documents
Applied substring deduplication for documents with more than 6000 characters
21.67% of data (in bytes) being duplicated

Personally identifiable information

Used rule-based approach with regular expressions
Redacted instances of KEY, EMAIL, USER, and IP_ADDRESS

A first look at roots

1.6 Terabytes of multilingual text
Figure 4 compares the sizes of corpora used to train large language models
Interactive dataset card deck documents individual components of the corpus

Natural languages

Corpus created through crowdsourcing
46 languages from 3 macroareas and 9 language families
English is the largest part of the corpus (30.03%)
Median size of a document is 1,129 bytes
Online interactive exploration tool available for more detailed breakdown

Programming languages

13 programming languages are represented in the code subset of the corpus
Java, PHP, and C++ make up more than half of all documents
A heuristic is used to identify configuration and test files
5.23% of the data consists of configuration files and 7.88% of test files
10.9M duplicate files and 4.1M unique files are found in the clusters
32% of the data consists of near-duplicates
Syntax checkers are used to validate 500K samples of Python and PHP code
1% of the Python data and 2% of the PHP files do not pass the syntax check

Tokenizer analysis of the component datasets

A tokenizer trained on a dataset can be used to measure the number of tokens produced for a byte of natural language.
Outlier values, such as incorrectly classified languages or crawling errors, can be spotted using tokenizers.
Tokenizers trained on different corpora can be compared to see how they differ.
Tokenizers can be used to identify outlier components in a dataset.

Conclusion

ROOTS is a massive multilingual corpus created by an international collaboration of researchers
Data-first approach was used to train the BLOOM model
Tooling developed throughout the project is released
BigScience Research Workshop was conceived as a collaborative and value-driven endeavor
Core values chosen by the collaborators contributing to data efforts were articulated
Values include openness, reproducibility, responsibility, diversity, and inclusivity
Participatory approaches were used to bridge gaps between model development and deployment
Framework was developed to uphold rights and responsibilities of stakeholders
Use of data from Common Crawl is a point of tension between drive to present research artifact and values of consent and privacy
Limitations include use of data from Common Crawl and reliance on medium to large sources of digitized content
Tools used to obtain crowdsourced dataset include pseudocode to recreate text structure from HTML code, visualisation tool, and functions used in processing pipeline

Link to paper#

Abstract#

Paper Content#

Introduction#

Outline of the paper#

Related work#

(crowd) sourcing a language resource catalogue#

Obtaining data from the identified resources#

Processing pipeline for quality improvement on crowdsourced datasets#

Processing oscar#

Deduplication#

Personally identifiable information#

A first look at roots#

Natural languages#

Programming languages#

Tokenizer analysis of the component datasets#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Outline of the paper

Related work

(crowd) sourcing a language resource catalogue

Obtaining data from the identified resources

Processing pipeline for quality improvement on crowdsourced datasets

Processing oscar

Deduplication

Personally identifiable information

A first look at roots

Natural languages

Programming languages

Tokenizer analysis of the component datasets

Conclusion