Toward Robust, Reliable, and Generalizable Models for Tabular Data

dc.contributor.advisorSchmidt, Ludwig
dc.contributor.authorGardner, Joshua
dc.date.accessioned2024-10-16T03:11:55Z
dc.date.available2024-10-16T03:11:55Z
dc.date.issued2024-10-16
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractTabular data -- spreadsheet-type data with rows and columns -- is widely used across many domains and real-world applications, from finance to healthcare to natural and social sciences. However, tabular data modeling has received far less attention, and seen far less rapid progress, than modern AI for other forms of data (such as images, natural language, and audio). In this dissertation, I conduct a series of studies to empirically assess the conditions under which modern tabular methods succeed and fail in prediction tasks, and leverage the results of these studies to develop a new approach to tabular modeling. Specifically, I first study the properties of tabular models under (1) subpopulation shift, and (2) distribution shift. The results show (among other findings) that simple, standard models with low in-distribution test error achieve best-in-class robustness under both forms of shift. Then, I leverage the findings of these studies to build a first-of-its-kind, state of the art foundation model for tabular data prediction, TabuLa. TabuLa exhibits the ability to perform zero-shot generalization to new tables, a capability not possible with existing state-of-the-art tabular models such as XGBoost and TabPFN. TabuLa also outperforms these state-of-the-art models in few-shot generalization, with sample efficiency up to 16x better than existing methods. Collectively, these works point toward new and exciting future directions for tabular data and open up opportunities for high-quality, general-purpose tabular AI models.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherGardner_washington_0250E_27521.pdf
dc.identifier.urihttps://hdl.handle.net/1773/52462
dc.language.isoen_US
dc.rightsCC BY
dc.subjectdistribution shift
dc.subjectmachine learning
dc.subjectsubgroup robustness
dc.subjecttabular
dc.subjectXGBoost
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleToward Robust, Reliable, and Generalizable Models for Tabular Data
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gardner_washington_0250E_27521.pdf
Size:
8.92 MB
Format:
Adobe Portable Document Format