Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape

Nguyen, Thao

Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape

dc.contributor.advisor	Oh, Sewoong
dc.contributor.advisor	Schmidt, Ludwig
dc.contributor.author	Nguyen, Thao
dc.date.accessioned	2026-04-20T15:27:08Z
dc.date.available	2026-04-20T15:27:08Z
dc.date.issued	2026-04-20
dc.date.submitted	2026
dc.description	Thesis (Ph.D.)--University of Washington, 2026
dc.description.abstract	Foundation models have transformed the machine learning landscape with unprecedented generalization capabilities across a variety of tasks. Central to their success is the data on which they are trained, which has grown massively in scale through large web crawls and data generation efforts. Despite growing awareness of the need for data curation, current data practices remain largely heuristic and coupled with specific model and training configurations, making it difficult to isolate data-centric contributions. In this thesis, I present my work towards developing systematic, generalizable, and timely methods to optimize dataset design for foundation models. In the first work, I provided one of the earliest empirical demonstrations that indiscriminately mixing different web data sources undermines model generalization, establishing data quality as a foundational principle for large-scale curation. As the field embraced data quality and proposed increasingly aggressive filtering pipelines, I found that these methods tend to overfit to existing benchmarks and systematically exclude valuable data, such as non-English content, which can improve model performance as a whole. My subsequent work thus argues that diversity in representation should be a deliberate design decision in the curation process, instead of existing only as a byproduct. Next, moving beyond filtering as the primary curation tool, I proposed image recaptioning as a way to transform low-quality image-text pairs into useful training data. Rather than asking what data to discard, my research instead asked what discarded data can be recovered. In the last work covered by this thesis, I extended this philosophy to the text domain. I addressed the growing scarcity of high-quality web texts by offering a sustainable approach to recycle discarded documents, effectively doubling the yield of useful pretraining tokens. Collectively, my research contributes to establishing data curation as a scientific discipline---one that is systematic, adaptive, and central to the future of foundation model development.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Nguyen_washington_0250E_29348.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55475
dc.language.iso	en_US
dc.rights	none
dc.subject	data curation
dc.subject	data filtering
dc.subject	foundation models
dc.subject	language models
dc.subject	multimodal models
dc.subject	pretraining
dc.subject	Computer science
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Data as Foundation: Designing Systematic Curation for an Evolving Foundation Model Landscape
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Nguyen_washington_0250E_29348.pdf
Size:: 13.56 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering