Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ROOTS is a 1.6TB multilingual text corpus.
  • It is used to train BLOOM, the largest language model.
  • ROOTS Search Tool is a search engine for the entire ROOTS corpus.
  • It offers fuzzy and exact search capabilities.
  • ROOTS is the largest corpus to date that can be investigated this way.
  • The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces.

Paper Content

Introduction

  • LLMs are used in NLP
  • Demand for training data is increasing
  • Quality and source of data is a concern
  • Need to characterize data to understand model performance
  • Researchers exploring ways to describe large datasets
  • User-friendly tools for qualitative analysis are missing
  • Presenting ROOTS Search Tool for 1.6TB multilingual ROOTS corpus
  • Tool facilitates qualitative analysis of web-scale corpus
  • Qualitative analysis of training data is essential for model understanding and governance
  • Corpus linguistics is an area of research that studies large volumes of text
  • The British National Corpus is an example of a text collection that was created to represent British English
  • Corpus linguistics developed sophisticated methodologies for studying text
  • As LLMs grew, so did the need for massive pretraining datasets
  • These datasets can contain synthetic data, privacy-infringing data, incorrect language codes, and translations
  • Information Retrieval is another Machine Learning domain that inspects large data collections
  • There have been few efforts to apply Information Retrieval to study LLM training data
  • This paper is the first principled effort to provide search access to the training corpus of an existing large language model

The roots corpus

  • ROOTS corpus is a high-quality, multilingual text corpus
  • ROOTS consists of 1.6TB of data in 46 natural and 13 programming languages

Data governance

  • BLOOM model developed within BigScience project
  • Data governance identified as high-impact lever of action
  • Framework designed to meet needs of distributed data governance
  • Partial implementation used for ROOTS data
  • Tool enables examination and feedback for data sources
  • Tool provides 128-word snippets of indexed documents
  • Users able to flag specific search results with explanation

Data pre-processing

  • Documents in ROOTS vary in length, with some as long as 282,571 words.
  • Documents are split into snippets of at most 128 words for fuzzy search.
  • Unique Result IDs are created to trace search results back to their source.
  • PII redaction script is applied to OSCAR prior to BLOOM training.

Implementation

  • ROOTS corpus is organized in 498 datasets
  • Each dataset is annotated with a language identifier
  • Two types of identifiers: individual language and language within a language group
  • 13 sparse, BM25 indices built
  • Exact search backend leverages a suffix array implementation
  • User interface built with Gradio and served via Hugging Face Spaces
  • Fuzzy searches can be performed in a user-specified language or all languages
  • Option to auto-detect language of query with FastText classifier
  • Results displayed in order of decreasing relevance
  • PII redaction applied to all results

Use cases

  • Detecting and obfuscating PII in documents
  • Tool allows searching for specific PII
  • Detecting problematic content
  • Studying representation of dialects and social groups
  • Detecting presence of specific information
  • Detecting plagiarism/memorization
  • Verifying originality
  • Non-existing facts
  • Enabling data removal requests
  • Data contamination
  • Language contamination
  • Word sense disambiguation
  • Pre-processing issues

Limitations and future work

  • Limitations of the work include only providing short snippets of indexed texts and issues with exact and fuzzy search.
  • Tool is heavily influenced by UX of search engines and has similar core functionality.
  • Future versions will review classic corpus analysis tools for ideas of different presentation modes.
  • Will add more quantitative information, such as term frequency, number of hits, and co-occurrence statistics.