Toward Robust, Reliable, and Generalizable Models for Tabular Data

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Tabular data -- spreadsheet-type data with rows and columns -- is widely used across many domains and real-world applications, from finance to healthcare to natural and social sciences. However, tabular data modeling has received far less attention, and seen far less rapid progress, than modern AI for other forms of data (such as images, natural language, and audio). In this dissertation, I conduct a series of studies to empirically assess the conditions under which modern tabular methods succeed and fail in prediction tasks, and leverage the results of these studies to develop a new approach to tabular modeling. Specifically, I first study the properties of tabular models under (1) subpopulation shift, and (2) distribution shift. The results show (among other findings) that simple, standard models with low in-distribution test error achieve best-in-class robustness under both forms of shift. Then, I leverage the findings of these studies to build a first-of-its-kind, state of the art foundation model for tabular data prediction, TabuLa. TabuLa exhibits the ability to perform zero-shot generalization to new tables, a capability not possible with existing state-of-the-art tabular models such as XGBoost and TabPFN. TabuLa also outperforms these state-of-the-art models in few-shot generalization, with sample efficiency up to 16x better than existing methods. Collectively, these works point toward new and exciting future directions for tabular data and open up opportunities for high-quality, general-purpose tabular AI models.

Description

Thesis (Ph.D.)--University of Washington, 2024

Citation

DOI