Systematic Explorations for Data-Efficient LLM Training

Li, Jeffrey

Systematic Explorations for Data-Efficient LLM Training

dc.contributor.advisor	Schmidt, Ludwig
dc.contributor.advisor	Ratner, Alexander
dc.contributor.author	Li, Jeffrey
dc.date.accessioned	2026-04-20T15:27:02Z
dc.date.issued	2026-04-20
dc.date.submitted	2026
dc.description	Thesis (Ph.D.)--University of Washington, 2026
dc.description.abstract	Large language models (LLMs) have demonstrated remarkable performance across many downstream domains. When training these models, the first phase of pretraining involves learning to predict the next token across massive amounts of web-scale text data. While the ever-increasing scale of pretraining gives models a crucial foundation of knowledge and skills, it also comes with extremely high costs, constraining both achievable performance (within practical compute budgets) and amenability to scientific study. In this dissertation, we discuss our work along two key directions for more data-efficient LLM pretraining. First, we study data curation, tackling the problem of how to best process and filter Internet data into usable training datasets. Second, we consider the question of how to best continually update models on new data as the world evolves over time. For both directions, we propose novel benchmark setups to systematically explore different strategies and highlight key challenges, ultimately resulting in interventions that offer significant efficiency gains. Beyond pretraining, we also examine the labeled data bottleneck when fine-tuning models for specific tasks. We investigate the intersection of programmatic weak supervision and semi-supervised learning, clarifying when and how the latter can further improve the labeling efficiency of the former. Collectively, these works aim to contribute to the broader effort of making LLM training a more effective and scientifically rigorous practice.
dc.embargo.lift	2027-04-20T15:27:02Z
dc.embargo.terms	Restrict to UW for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Li_washington_0250E_29199.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55470
dc.language.iso	en_US
dc.rights	none
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Systematic Explorations for Data-Efficient LLM Training
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Li_washington_0250E_29199.pdf
Size:: 9.11 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering