Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs are used in NLP
Demand for training data is increasing
Quality and source of data is a concern
Need to characterize data to understand model performance
Researchers exploring ways to describe large datasets
User-friendly tools for qualitative analysis are missing
Presenting ROOTS Search Tool for 1.6TB multilingual ROOTS corpus
Tool facilitates qualitative analysis of web-scale corpus
Qualitative analysis of training data is essential for model understanding and governance

Corpus linguistics is an area of research that studies large volumes of text
The British National Corpus is an example of a text collection that was created to represent British English
Corpus linguistics developed sophisticated methodologies for studying text
As LLMs grew, so did the need for massive pretraining datasets
These datasets can contain synthetic data, privacy-infringing data, incorrect language codes, and translations
Information Retrieval is another Machine Learning domain that inspects large data collections
There have been few efforts to apply Information Retrieval to study LLM training data
This paper is the first principled effort to provide search access to the training corpus of an existing large language model

ROOTS corpus is organized in 498 datasets
Each dataset is annotated with a language identifier
Two types of identifiers: individual language and language within a language group
13 sparse, BM25 indices built
Exact search backend leverages a suffix array implementation
User interface built with Gradio and served via Hugging Face Spaces
Fuzzy searches can be performed in a user-specified language or all languages
Option to auto-detect language of query with FastText classifier
Results displayed in order of decreasing relevance
PII redaction applied to all results

Limitations of the work include only providing short snippets of indexed texts and issues with exact and fuzzy search.
Tool is heavily influenced by UX of search engines and has similar core functionality.
Future versions will review classic corpus analysis tools for ideas of different presentation modes.
Will add more quantitative information, such as term frequency, number of hits, and co-occurrence statistics.