Interpretable Machine Learning for Biomarker Identification in RNA Seq Cancer Data

dc.contributor.advisorKim, Wooyoung
dc.contributor.authorNewton, Jeremy
dc.date.accessioned2026-02-05T19:29:51Z
dc.date.available2026-02-05T19:29:51Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractExisting research on RNA Seq gene expression biomarkers has provided various methods to select a small list of genes as cancer biomarkers from a large number of gene expression data. Previous methods for identifying potential gene expression cancer biomarkers have focused on statistical analysis, but other methods have incorporated machine learning, often including Interpretable Machine Learning (iML) techniques. On 16 cancer types from TCGA data, we used inherently interpretable machine learning models: Logistic Regression, Random Forest, and Linear Support Vector Machine to narrow down subsets of potential genes as biomarkers using the trained models' feature importance rankings. We subsequently applied model-agnostic iML techniques, such as Shapley Additive Explanations (SHAP) and Permutation Importance, to narrow down the subsets even further. We compared classification performance between machine learning models trained on iML selected features with features selected by statistical methods, and biomarkers from external research. We found that iML biomarker selection methods lead to comparable or better classification performance on these datasets than the biomarkers from outside research, or from statistical analysis alone. Mutual Information estimation (MI) was a surprisingly useful technique for initial feature selection, and iML techniques improved the MI selected features for classification. We cross-checked potential biomarkers with biomedical annotations and gene pathway analysis, finding some support for the validity of the biomarkers.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherNewton_washington_0250O_29168.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55117
dc.language.isoen_US
dc.relation.haspartJGN_Thesis_2025_GDrive_SupplementalMaterials_1.zip; other; GoogleDrive csvs and extra figures.
dc.relation.haspartJGN_Thesis_2025_GDrive_SupplementalMaterials_2.zip; other; GoogleDrive csvs and extra figures.
dc.relation.haspartJGNThesis2025_SM(1).zip; other; HardDrive extra data csvs, and figures.
dc.rightsnone
dc.subjectbiomarkers
dc.subjectcancer
dc.subjectinterpretable machine learning
dc.subjectmachine learning
dc.subjectRNA seq
dc.subjectTCGA
dc.subjectComputer science
dc.subjectBioinformatics
dc.subject.otherTo Be Assigned
dc.titleInterpretable Machine Learning for Biomarker Identification in RNA Seq Cancer Data
dc.typeThesis

Files

Original bundle

Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
Newton_washington_0250O_29168.pdf
Size:
6.62 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
JGN_Thesis_2025_GDrive_SupplementalMaterials_1.zip
Size:
100.39 MB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
JGN_Thesis_2025_GDrive_SupplementalMaterials_2.zip
Size:
222.32 MB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
JGNThesis2025_SM(1).zip
Size:
308.71 MB
Format:
Unknown data format

Collections