Building Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation

Ding, Xiruo

Building Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation

dc.contributor.advisor	Cohen, Trevor
dc.contributor.author	Ding, Xiruo
dc.date.accessioned	2025-01-23T20:03:21Z
dc.date.issued	2025-01-23
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Machine learning and deep learning have consistently delivered groundbreaking contributions across a wide range of disciplines. Biomedical research also benefits from such methods at every scale, from the molecular level (such as in structural biology) to the population level. Many learning algorithms require adequate amount of data to fully train a model, and also assume no difference between the training data and test data. This may be achievable for problems in the general domain. For example, large datasets exist for computer vision (CIFAR-10, CIFAR-100, etc.) and natural language processing (Amazon Reviews, Yelp Reviews, Wikipedia, etc.). However, in biomedical research, it is challenging to collect data on the order of millions when high quality patient-related data are needed. One feasible solution is to combine data from several sites. This approach can also increase the variety of data, thus helping to build robust models. However, the models trained on such settings may recognize spurious correlations between data provenance and the target of interest. Naturally, this can also happen when subpopulations exist, each of which has different characteristics. This effect can be detrimental when model is deployed in a new setting where provenance composition shifts. This thesis builds on such scenarios where confounding by provenance and provenance shift are the main concerns. Formal definitions and a simulation framework are introduced first. Building upon these, the aim is to find useful ways to build models that are robust to such provenance shift while maintaining reasonable performance. This goal is attained through different means, from statistical adjustment through distribution adjustment to architecture adjustment. Two key contributions are: (1) a framework for experimentally simulating different degrees of provenance shift and evaluating model robustness and performance; (2) several effective adjustment methods to build more robust models. The framework and adjustment methods were tested on three datasets, two from the biomedical domain and one from the general domain, to validate their generalizability. Results indicate that the methods, focusing on different aspects of the modeling procedure, can help improve model robustness, and that model performance can also be improved when provenance shift is extreme. This work contributes to our understanding of how provenance shift impacts model performance, and provides methods to develop more robust models that can withstand the challenges posed by such shifts, ultimately leading to algorithms that are more reliable and trustworthy, and less biased.
dc.embargo.lift	2026-01-23T20:03:21Z
dc.embargo.terms	Delay release for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Ding_washington_0250E_27616.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52690
dc.language.iso	en_US
dc.rights	CC BY-SA
dc.subject	confounding shift
dc.subject	language models
dc.subject	multi-institutional datasets
dc.subject	robustness
dc.subject	text classification
dc.subject	Bioinformatics
dc.subject	Computer science
dc.subject.other	Biomedical and health informatics
dc.title	Building Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ding_washington_0250E_27616.pdf
Size:: 56.73 MB
Format:: Adobe Portable Document Format

Download

Collections

Biomedical and health informatics