Navigating the Ocean of Language Model Training Data

Liu, Jiacheng

Navigating the Ocean of Language Model Training Data

Files

Liu_washington_0250E_29060.pdf (15.59 MB)

Date

2026-02-05

relationships.isAuthorOf

Liu, Jiacheng

Abstract

One crucial step toward understanding large language models (LLMs) is to understand their training data. Modern LLMs are trained on text corpora with trillions of tokens, hindering them from being easily analyzed. In this thesis, I discuss my research on making these massive text corpora efficiently searchable and revealing insights to the connection between LLMs and their training data. First, I developed infini-gram, a search engine system that enables fast string counting and document retrieval. With infini-gram, I indexed four open text corpora commonly used for LLM pretraining, totaling 5 trillion tokens. A by-product was the biggest n-gram language model ever built as of the date of publication, which I combined with neural LLMs to greatly improve their perplexity. Next, on top of infini-gram, I led the development of a system for tracing LLM generations into their multi-trillion-token training data in real time, named OLMoTrace. OLMoTrace shows long verbatim matches between LLM outputs and the full training data, enabling us to do fact-checking, trace "creative expressions", understand LLM's math capabilities, and much more. Finally, to enable searching in even bigger, Internet-scale corpora with limited budget, more storage-efficient indexing techniques are needed. To that end, we developed infini-gram mini, a search system with 12x less storage requirement than the original infini-gram, conceptually allowing us to index the entirety of Common Crawl (the main source of training data for LLMs). We indexed 83TB of text, including the Common Crawl snapshots between January and July 2025, making it the largest body of searchable text in the open-source community. With infini-gram mini, we revealed that many crucial LLM evaluation benchmarks are heavily contaminated, and we are hosting a public bulletin to continuously monitor this dire evaluation crisis. Together, my research enables everyone to inspect and understand LLM training data at scale, and paves way towards comprehending and debugging LLM behaviors from a data perspective.