Data-Centric Preprocessing for Multivariate Biological Data: From Conventional Pipelines to Novel Pair-Based Approach

dc.contributor.advisorKim, Wooyoung
dc.contributor.authorSelke, William D
dc.date.accessioned2025-08-01T22:09:44Z
dc.date.issued2025-08-01
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractHigh-throughput biological datasets, such as those from CAR-T cell therapy, are challenging to analyze due to high dimensionality, heterogeneity, and noise. Standard algorithms often fail not from lack of complexity, but from ignoring the biological structure of the data. This thesis introduces Paired Vector Centralization (PVC), a normalization method designed for paired experimental designs, like responder vs. toxicity comparisons. PVC re-centers feature vectors around biologically meaningful contrasts, correcting for baseline drift and concentration effects. Applied to protein–protein interaction (PPI) data from pre-infusion CAR-T assays, PVC improves classification and embedding quality over conventional methods. Additional experiments explored hybrid approaches, including Tab2Img and a Convolutional Neural Network–Random Forest (CNN-RF) ensemble, which offered insights but further underscored the value of biologically informed preprocessing. Overall, the findings support a shift from treating biological data as static snapshots to modeling them as dynamic transitions between paired states. Embedding domain knowledge directly into preprocessing enhances signal recovery and interpretability. This pairing-based perspective may offer broader value across immunology and other high-dimensional fields, especially where understanding relationships between conditions matters more than analyzing isolated measurements.
dc.embargo.lift2027-07-22T22:09:44Z
dc.embargo.termsRestrict to UW for 2 years -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherSelke_washington_0250O_28527.pdf
dc.identifier.urihttps://hdl.handle.net/1773/53216
dc.language.isoen_US
dc.rightsCC BY
dc.subjectCAR-T
dc.subjectPaired Vector Centralization (PVC)
dc.subjectprotein–protein interaction (PPI)
dc.subjectBioinformatics
dc.subjectComputer science
dc.subject.otherComputing and software systems
dc.titleData-Centric Preprocessing for Multivariate Biological Data: From Conventional Pipelines to Novel Pair-Based Approach
dc.typeThesis

Files