Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape

dc.contributor.advisorOh, Sewoong
dc.contributor.advisorSchmidt, Ludwig
dc.contributor.authorNguyen, Thao
dc.date.accessioned2026-04-20T15:27:08Z
dc.date.available2026-04-20T15:27:08Z
dc.date.issued2026-04-20
dc.date.submitted2026
dc.descriptionThesis (Ph.D.)--University of Washington, 2026
dc.description.abstractFoundation models have transformed the machine learning landscape with unprecedented generalization capabilities across a variety of tasks. Central to their success is the data on which they are trained, which has grown massively in scale through large web crawls and data generation efforts. Despite growing awareness of the need for data curation, current data practices remain largely heuristic and coupled with specific model and training configurations, making it difficult to isolate data-centric contributions. In this thesis, I present my work towards developing systematic, generalizable, and timely methods to optimize dataset design for foundation models. In the first work, I provided one of the earliest empirical demonstrations that indiscriminately mixing different web data sources undermines model generalization, establishing data quality as a foundational principle for large-scale curation. As the field embraced data quality and proposed increasingly aggressive filtering pipelines, I found that these methods tend to overfit to existing benchmarks and systematically exclude valuable data, such as non-English content, which can improve model performance as a whole. My subsequent work thus argues that diversity in representation should be a deliberate design decision in the curation process, instead of existing only as a byproduct. Next, moving beyond filtering as the primary curation tool, I proposed image recaptioning as a way to transform low-quality image-text pairs into useful training data. Rather than asking what data to discard, my research instead asked what discarded data can be recovered. In the last work covered by this thesis, I extended this philosophy to the text domain. I addressed the growing scarcity of high-quality web texts by offering a sustainable approach to recycle discarded documents, effectively doubling the yield of useful pretraining tokens. Collectively, my research contributes to establishing data curation as a scientific discipline---one that is systematic, adaptive, and central to the future of foundation model development.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherNguyen_washington_0250E_29348.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55475
dc.language.isoen_US
dc.rightsnone
dc.subjectdata curation
dc.subjectdata filtering
dc.subjectfoundation models
dc.subjectlanguage models
dc.subjectmultimodal models
dc.subjectpretraining
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleData as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nguyen_washington_0250E_29348.pdf
Size:
13.56 MB
Format:
Adobe Portable Document Format