Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Need for large-scale high-quality text datasets
- BigScience workshop formed to research and train large language models
- ROOTS corpus created, 1.6TB dataset spanning 59 languages
- BLOOM language model trained using ROOTS corpus
- Large initial subset of corpus released with processing tools
Paper Content
Introduction
- BigScience1 is a one-year open collaborative research initiative
- Goal was to train an open-access, massively multilingual language model
- Engaged in ethical, sociopolitical, and data governance issues
- Four working groups: Data Governance, Data Sourcing and Preparation, Privacy, Legal Scholarship
- Released a large subset of ROOTS
- Released data tools used to curate, source, clean and inspect constituent datasets
Outline of the paper
- Collected a web-scale dataset covering 59 languages
- 46 natural languages and 13 programming languages
- 62% of text from community-selected and documented list of language data sources
- 38% of text from pre-processed web crawl, OSCAR
- Filtered with help of native speakers
Related work
- Pre-trained models are used in natural language processing
- Performance is based on model size and dataset size/quality
- Recent models trained on up to 1.4 trillion tokens
- Datasets are not usually released
- Exceptions include Pile, C4, mC4, CC100, and OSCAR
- Tooling, visualization, and replication of datasets is needed
- Documentation of corpora is becoming more common
(crowd) sourcing a language resource catalogue
- 62% of the dataset was made up of monolingual and multilingual language resources
- Metadata was collected through open submissions and hackathons
- 252 sources were collected, including at least 21 per language category
- Additional Arabic language resources were gathered from the Masader repository
- Websites were selected to increase geographical diversity of English, Spanish, and Chinese language data
- Code data was selected from GitHub and StackExchange to test large language models’ ability to handle computer code
Obtaining data from the identified resources
- Leveraged BigScience Catalogue and Masader repository to obtain text from identified sources
- Established 2-phase approach to collect data sources and map them to a common format
- Organized open hackathon to gather identified sources on Hugging Face Datasets hub
- Language segmentation to obtain monolingual datasets
- Documents consist of two fields: “text” and “meta”
- Websites required a particular effort and dedicated pipeline
- Pseudo-crawling used to retrieve pages from Common Crawl
- Domain names from two sources: metadata and participants
- Collected URLs from Common Crawl index
- Extracted text from HTML pages
- Collected code dataset from BigQuery
- Manually inspected, deduplicated, and made further selection of sources
- Removed datasets with high incidence of non-natural language
Processing pipeline for quality improvement on crowdsourced datasets
- Attempted to improve quality of text from HTML
- Applied processing pipeline to remove noisy data
- Functions categorized as document-scoped or dataset-scoped
- Cleaning functions remove text not part of main document
- Filtering functions remove entire document from corpus
- Built visualization tool to understand impact of each function
Processing oscar
- Chose to complement data with Common Crawl-based data
- Quantity of data is a strong factor in model performance
- Used OSCAR version 21.09 to make up 38% of final dataset
- Crawled data has issues such as machine-generated content and privacy risk
- Used quality indicators to filter out specific pages
Deduplication
- Data deduplication improves performance and decreases memorization of training data
- Initially used SimHash to remove near duplicate documents
- 0.7% of documents identified as near duplicates
- False positives among long documents
- Applied substring deduplication for documents with more than 6000 characters
- 21.67% of data (in bytes) being duplicated
Personally identifiable information
- Used rule-based approach with regular expressions
- Redacted instances of KEY, EMAIL, USER, and IP_ADDRESS
A first look at roots
- 1.6 Terabytes of multilingual text
- Figure 4 compares the sizes of corpora used to train large language models
- Interactive dataset card deck documents individual components of the corpus
Natural languages
- Corpus created through crowdsourcing
- 46 languages from 3 macroareas and 9 language families
- English is the largest part of the corpus (30.03%)
- Median size of a document is 1,129 bytes
- Online interactive exploration tool available for more detailed breakdown
Programming languages
- 13 programming languages are represented in the code subset of the corpus
- Java, PHP, and C++ make up more than half of all documents
- A heuristic is used to identify configuration and test files
- 5.23% of the data consists of configuration files and 7.88% of test files
- 10.9M duplicate files and 4.1M unique files are found in the clusters
- 32% of the data consists of near-duplicates
- Syntax checkers are used to validate 500K samples of Python and PHP code
- 1% of the Python data and 2% of the PHP files do not pass the syntax check
Tokenizer analysis of the component datasets
- A tokenizer trained on a dataset can be used to measure the number of tokens produced for a byte of natural language.
- Outlier values, such as incorrectly classified languages or crawling errors, can be spotted using tokenizers.
- Tokenizers trained on different corpora can be compared to see how they differ.
- Tokenizers can be used to identify outlier components in a dataset.
Conclusion
- ROOTS is a massive multilingual corpus created by an international collaboration of researchers
- Data-first approach was used to train the BLOOM model
- Tooling developed throughout the project is released
- BigScience Research Workshop was conceived as a collaborative and value-driven endeavor
- Core values chosen by the collaborators contributing to data efforts were articulated
- Values include openness, reproducibility, responsibility, diversity, and inclusivity
- Participatory approaches were used to bridge gaps between model development and deployment
- Framework was developed to uphold rights and responsibilities of stakeholders
- Use of data from Common Crawl is a point of tension between drive to present research artifact and values of consent and privacy
- Limitations include use of data from Common Crawl and reliance on medium to large sources of digitized content
- Tools used to obtain crowdsourced dataset include pseudocode to recreate text structure from HTML code, visualisation tool, and functions used in processing pipeline