Systematic Explorations for Data-Efficient LLM Training

Li, Jeffrey

Systematic Explorations for Data-Efficient LLM Training

Files

Li_washington_0250E_29199.pdf (9.11 MB)

Date

2026-04-20

Authors

Li, Jeffrey

Abstract

Large language models (LLMs) have demonstrated remarkable performance across many downstream domains. When training these models, the first phase of pretraining involves learning to predict the next token across massive amounts of web-scale text data. While the ever-increasing scale of pretraining gives models a crucial foundation of knowledge and skills, it also comes with extremely high costs, constraining both achievable performance (within practical compute budgets) and amenability to scientific study. In this dissertation, we discuss our work along two key directions for more data-efficient LLM pretraining. First, we study data curation, tackling the problem of how to best process and filter Internet data into usable training datasets. Second, we consider the question of how to best continually update models on new data as the world evolves over time. For both directions, we propose novel benchmark setups to systematically explore different strategies and highlight key challenges, ultimately resulting in interventions that offer significant efficiency gains. Beyond pretraining, we also examine the labeled data bottleneck when fine-tuning models for specific tasks. We investigate the intersection of programmatic weak supervision and semi-supervised learning, clarifying when and how the latter can further improve the labeling efficiency of the former. Collectively, these works aim to contribute to the broader effort of making LLM training a more effective and scientifically rigorous practice.