Navigating the Ocean of Language Model Training Data

Liu, Jiacheng

Navigating the Ocean of Language Model Training Data

dc.contributor.advisor	Choi, Yejin
dc.contributor.advisor	Hajishirzi, Hannaneh
dc.contributor.author	Liu, Jiacheng
dc.date.accessioned	2026-02-05T19:34:19Z
dc.date.available	2026-02-05T19:34:19Z
dc.date.issued	2026-02-05
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	One crucial step toward understanding large language models (LLMs) is to understand their training data. Modern LLMs are trained on text corpora with trillions of tokens, hindering them from being easily analyzed. In this thesis, I discuss my research on making these massive text corpora efficiently searchable and revealing insights to the connection between LLMs and their training data. First, I developed infini-gram, a search engine system that enables fast string counting and document retrieval. With infini-gram, I indexed four open text corpora commonly used for LLM pretraining, totaling 5 trillion tokens. A by-product was the biggest n-gram language model ever built as of the date of publication, which I combined with neural LLMs to greatly improve their perplexity. Next, on top of infini-gram, I led the development of a system for tracing LLM generations into their multi-trillion-token training data in real time, named OLMoTrace. OLMoTrace shows long verbatim matches between LLM outputs and the full training data, enabling us to do fact-checking, trace "creative expressions", understand LLM's math capabilities, and much more. Finally, to enable searching in even bigger, Internet-scale corpora with limited budget, more storage-efficient indexing techniques are needed. To that end, we developed infini-gram mini, a search system with 12x less storage requirement than the original infini-gram, conceptually allowing us to index the entirety of Common Crawl (the main source of training data for LLMs). We indexed 83TB of text, including the Common Crawl snapshots between January and July 2025, making it the largest body of searchable text in the open-source community. With infini-gram mini, we revealed that many crucial LLM evaluation benchmarks are heavily contaminated, and we are hosting a public bulletin to continuously monitor this dire evaluation crisis. Together, my research enables everyone to inspect and understand LLM training data at scale, and paves way towards comprehending and debugging LLM behaviors from a data perspective.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Liu_washington_0250E_29060.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55192
dc.language.iso	en_US
dc.rights	CC BY-SA
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Navigating the Ocean of Language Model Training Data
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Liu_washington_0250E_29060.pdf
Size:: 15.59 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering