Building Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation

dc.contributor.advisorCohen, Trevor
dc.contributor.authorDing, Xiruo
dc.date.accessioned2025-01-23T20:03:21Z
dc.date.issued2025-01-23
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractMachine learning and deep learning have consistently delivered groundbreaking contributions across a wide range of disciplines. Biomedical research also benefits from such methods at every scale, from the molecular level (such as in structural biology) to the population level. Many learning algorithms require adequate amount of data to fully train a model, and also assume no difference between the training data and test data. This may be achievable for problems in the general domain. For example, large datasets exist for computer vision (CIFAR-10, CIFAR-100, etc.) and natural language processing (Amazon Reviews, Yelp Reviews, Wikipedia, etc.). However, in biomedical research, it is challenging to collect data on the order of millions when high quality patient-related data are needed. One feasible solution is to combine data from several sites. This approach can also increase the variety of data, thus helping to build robust models. However, the models trained on such settings may recognize spurious correlations between data provenance and the target of interest. Naturally, this can also happen when subpopulations exist, each of which has different characteristics. This effect can be detrimental when model is deployed in a new setting where provenance composition shifts. This thesis builds on such scenarios where confounding by provenance and provenance shift are the main concerns. Formal definitions and a simulation framework are introduced first. Building upon these, the aim is to find useful ways to build models that are robust to such provenance shift while maintaining reasonable performance. This goal is attained through different means, from statistical adjustment through distribution adjustment to architecture adjustment. Two key contributions are: (1) a framework for experimentally simulating different degrees of provenance shift and evaluating model robustness and performance; (2) several effective adjustment methods to build more robust models. The framework and adjustment methods were tested on three datasets, two from the biomedical domain and one from the general domain, to validate their generalizability. Results indicate that the methods, focusing on different aspects of the modeling procedure, can help improve model robustness, and that model performance can also be improved when provenance shift is extreme. This work contributes to our understanding of how provenance shift impacts model performance, and provides methods to develop more robust models that can withstand the challenges posed by such shifts, ultimately leading to algorithms that are more reliable and trustworthy, and less biased.
dc.embargo.lift2026-01-23T20:03:21Z
dc.embargo.termsDelay release for 1 year -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherDing_washington_0250E_27616.pdf
dc.identifier.urihttps://hdl.handle.net/1773/52690
dc.language.isoen_US
dc.rightsCC BY-SA
dc.subjectconfounding shift
dc.subjectlanguage models
dc.subjectmulti-institutional datasets
dc.subjectrobustness
dc.subjecttext classification
dc.subjectBioinformatics
dc.subjectComputer science
dc.subject.otherBiomedical and health informatics
dc.titleBuilding Robust Text Classification Models under Provenance Shift: Methods of Adjustment and a Framework for Evaluation
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ding_washington_0250E_27616.pdf
Size:
56.73 MB
Format:
Adobe Portable Document Format