Navigating the Ocean of Language Model Training Data

dc.contributor.advisorChoi, Yejin
dc.contributor.advisorHajishirzi, Hannaneh
dc.contributor.authorLiu, Jiacheng
dc.date.accessioned2026-02-05T19:34:19Z
dc.date.available2026-02-05T19:34:19Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractOne crucial step toward understanding large language models (LLMs) is to understand their training data. Modern LLMs are trained on text corpora with trillions of tokens, hindering them from being easily analyzed. In this thesis, I discuss my research on making these massive text corpora efficiently searchable and revealing insights to the connection between LLMs and their training data. First, I developed infini-gram, a search engine system that enables fast string counting and document retrieval. With infini-gram, I indexed four open text corpora commonly used for LLM pretraining, totaling 5 trillion tokens. A by-product was the biggest n-gram language model ever built as of the date of publication, which I combined with neural LLMs to greatly improve their perplexity. Next, on top of infini-gram, I led the development of a system for tracing LLM generations into their multi-trillion-token training data in real time, named OLMoTrace. OLMoTrace shows long verbatim matches between LLM outputs and the full training data, enabling us to do fact-checking, trace "creative expressions", understand LLM's math capabilities, and much more. Finally, to enable searching in even bigger, Internet-scale corpora with limited budget, more storage-efficient indexing techniques are needed. To that end, we developed infini-gram mini, a search system with 12x less storage requirement than the original infini-gram, conceptually allowing us to index the entirety of Common Crawl (the main source of training data for LLMs). We indexed 83TB of text, including the Common Crawl snapshots between January and July 2025, making it the largest body of searchable text in the open-source community. With infini-gram mini, we revealed that many crucial LLM evaluation benchmarks are heavily contaminated, and we are hosting a public bulletin to continuously monitor this dire evaluation crisis. Together, my research enables everyone to inspect and understand LLM training data at scale, and paves way towards comprehending and debugging LLM behaviors from a data perspective.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLiu_washington_0250E_29060.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55192
dc.language.isoen_US
dc.rightsCC BY-SA
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleNavigating the Ocean of Language Model Training Data
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liu_washington_0250E_29060.pdf
Size:
15.59 MB
Format:
Adobe Portable Document Format