Systematic Explorations for Data-Efficient LLM Training

dc.contributor.advisorSchmidt, Ludwig
dc.contributor.advisorRatner, Alexander
dc.contributor.authorLi, Jeffrey
dc.date.accessioned2026-04-20T15:27:02Z
dc.date.issued2026-04-20
dc.date.submitted2026
dc.descriptionThesis (Ph.D.)--University of Washington, 2026
dc.description.abstractLarge language models (LLMs) have demonstrated remarkable performance across many downstream domains. When training these models, the first phase of pretraining involves learning to predict the next token across massive amounts of web-scale text data. While the ever-increasing scale of pretraining gives models a crucial foundation of knowledge and skills, it also comes with extremely high costs, constraining both achievable performance (within practical compute budgets) and amenability to scientific study. In this dissertation, we discuss our work along two key directions for more data-efficient LLM pretraining. First, we study data curation, tackling the problem of how to best process and filter Internet data into usable training datasets. Second, we consider the question of how to best continually update models on new data as the world evolves over time. For both directions, we propose novel benchmark setups to systematically explore different strategies and highlight key challenges, ultimately resulting in interventions that offer significant efficiency gains. Beyond pretraining, we also examine the labeled data bottleneck when fine-tuning models for specific tasks. We investigate the intersection of programmatic weak supervision and semi-supervised learning, clarifying when and how the latter can further improve the labeling efficiency of the former. Collectively, these works aim to contribute to the broader effort of making LLM training a more effective and scientifically rigorous practice.
dc.embargo.lift2027-04-20T15:27:02Z
dc.embargo.termsRestrict to UW for 1 year -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLi_washington_0250E_29199.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55470
dc.language.isoen_US
dc.rightsnone
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleSystematic Explorations for Data-Efficient LLM Training
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Li_washington_0250E_29199.pdf
Size:
9.11 MB
Format:
Adobe Portable Document Format