Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ROOTS is a 1.6TB multilingual text corpus.
- It is used to train BLOOM, the largest language model.
- ROOTS Search Tool is a search engine for the entire ROOTS corpus.
- It offers fuzzy and exact search capabilities.
- ROOTS is the largest corpus to date that can be investigated this way.
- The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces.
Paper Content
Introduction
- LLMs are used in NLP
- Demand for training data is increasing
- Quality and source of data is a concern
- Need to characterize data to understand model performance
- Researchers exploring ways to describe large datasets
- User-friendly tools for qualitative analysis are missing
- Presenting ROOTS Search Tool for 1.6TB multilingual ROOTS corpus
- Tool facilitates qualitative analysis of web-scale corpus
- Qualitative analysis of training data is essential for model understanding and governance
Related work
- Corpus linguistics is an area of research that studies large volumes of text
- The British National Corpus is an example of a text collection that was created to represent British English
- Corpus linguistics developed sophisticated methodologies for studying text
- As LLMs grew, so did the need for massive pretraining datasets
- These datasets can contain synthetic data, privacy-infringing data, incorrect language codes, and translations
- Information Retrieval is another Machine Learning domain that inspects large data collections
- There have been few efforts to apply Information Retrieval to study LLM training data
- This paper is the first principled effort to provide search access to the training corpus of an existing large language model
The roots corpus
- ROOTS corpus is a high-quality, multilingual text corpus
- ROOTS consists of 1.6TB of data in 46 natural and 13 programming languages
Data governance
- BLOOM model developed within BigScience project
- Data governance identified as high-impact lever of action
- Framework designed to meet needs of distributed data governance
- Partial implementation used for ROOTS data
- Tool enables examination and feedback for data sources
- Tool provides 128-word snippets of indexed documents
- Users able to flag specific search results with explanation
Data pre-processing
- Documents in ROOTS vary in length, with some as long as 282,571 words.
- Documents are split into snippets of at most 128 words for fuzzy search.
- Unique Result IDs are created to trace search results back to their source.
- PII redaction script is applied to OSCAR prior to BLOOM training.
Implementation
- ROOTS corpus is organized in 498 datasets
- Each dataset is annotated with a language identifier
- Two types of identifiers: individual language and language within a language group
- 13 sparse, BM25 indices built
- Exact search backend leverages a suffix array implementation
- User interface built with Gradio and served via Hugging Face Spaces
- Fuzzy searches can be performed in a user-specified language or all languages
- Option to auto-detect language of query with FastText classifier
- Results displayed in order of decreasing relevance
- PII redaction applied to all results
Use cases
- Detecting and obfuscating PII in documents
- Tool allows searching for specific PII
- Detecting problematic content
- Studying representation of dialects and social groups
- Detecting presence of specific information
- Detecting plagiarism/memorization
- Verifying originality
- Non-existing facts
- Enabling data removal requests
- Data contamination
- Language contamination
- Word sense disambiguation
- Pre-processing issues
Limitations and future work
- Limitations of the work include only providing short snippets of indexed texts and issues with exact and fuzzy search.
- Tool is heavily influenced by UX of search engines and has similar core functionality.
- Future versions will review classic corpus analysis tools for ideas of different presentation modes.
- Will add more quantitative information, such as term frequency, number of hits, and co-occurrence statistics.