Data-Centric Preprocessing for Multivariate Biological Data: From Conventional Pipelines to Novel Pair-Based Approach

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

High-throughput biological datasets, such as those from CAR-T cell therapy, are challenging to analyze due to high dimensionality, heterogeneity, and noise. Standard algorithms often fail not from lack of complexity, but from ignoring the biological structure of the data. This thesis introduces Paired Vector Centralization (PVC), a normalization method designed for paired experimental designs, like responder vs. toxicity comparisons. PVC re-centers feature vectors around biologically meaningful contrasts, correcting for baseline drift and concentration effects. Applied to protein–protein interaction (PPI) data from pre-infusion CAR-T assays, PVC improves classification and embedding quality over conventional methods. Additional experiments explored hybrid approaches, including Tab2Img and a Convolutional Neural Network–Random Forest (CNN-RF) ensemble, which offered insights but further underscored the value of biologically informed preprocessing. Overall, the findings support a shift from treating biological data as static snapshots to modeling them as dynamic transitions between paired states. Embedding domain knowledge directly into preprocessing enhances signal recovery and interpretability. This pairing-based perspective may offer broader value across immunology and other high-dimensional fields, especially where understanding relationships between conditions matters more than analyzing isolated measurements.

Description

Thesis (Master's)--University of Washington, 2025

Citation

DOI