Toward Robust, Reliable, and Generalizable Models for Tabular Data
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Tabular data -- spreadsheet-type data with rows and columns -- is widely used across many domains and real-world applications, from finance to healthcare to natural and social sciences. However, tabular data modeling has received far less attention, and seen far less rapid progress, than modern AI for other forms of data (such as images, natural language, and audio). In this dissertation, I conduct a series of studies to empirically assess the conditions under which modern tabular methods succeed and fail in prediction tasks, and leverage the results of these studies to develop a new approach to tabular modeling. Specifically, I first study the properties of tabular models under (1) subpopulation shift, and (2) distribution shift. The results show (among other findings) that simple, standard models with low in-distribution test error achieve best-in-class robustness under both forms of shift. Then, I leverage the findings of these studies to build a first-of-its-kind, state of the art foundation model for tabular data prediction, TabuLa. TabuLa exhibits the ability to perform zero-shot generalization to new tables, a capability not possible with existing state-of-the-art tabular models such as XGBoost and TabPFN. TabuLa also outperforms these state-of-the-art models in few-shot generalization, with sample efficiency up to 16x better than existing methods. Collectively, these works point toward new and exciting future directions for tabular data and open up opportunities for high-quality, general-purpose tabular AI models.
Description
Thesis (Ph.D.)--University of Washington, 2024
