Interpretable Machine Learning for Biomarker Identification in RNA Seq Cancer Data
| dc.contributor.advisor | Kim, Wooyoung | |
| dc.contributor.author | Newton, Jeremy | |
| dc.date.accessioned | 2026-02-05T19:29:51Z | |
| dc.date.available | 2026-02-05T19:29:51Z | |
| dc.date.issued | 2026-02-05 | |
| dc.date.submitted | 2025 | |
| dc.description | Thesis (Master's)--University of Washington, 2025 | |
| dc.description.abstract | Existing research on RNA Seq gene expression biomarkers has provided various methods to select a small list of genes as cancer biomarkers from a large number of gene expression data. Previous methods for identifying potential gene expression cancer biomarkers have focused on statistical analysis, but other methods have incorporated machine learning, often including Interpretable Machine Learning (iML) techniques. On 16 cancer types from TCGA data, we used inherently interpretable machine learning models: Logistic Regression, Random Forest, and Linear Support Vector Machine to narrow down subsets of potential genes as biomarkers using the trained models' feature importance rankings. We subsequently applied model-agnostic iML techniques, such as Shapley Additive Explanations (SHAP) and Permutation Importance, to narrow down the subsets even further. We compared classification performance between machine learning models trained on iML selected features with features selected by statistical methods, and biomarkers from external research. We found that iML biomarker selection methods lead to comparable or better classification performance on these datasets than the biomarkers from outside research, or from statistical analysis alone. Mutual Information estimation (MI) was a surprisingly useful technique for initial feature selection, and iML techniques improved the MI selected features for classification. We cross-checked potential biomarkers with biomedical annotations and gene pathway analysis, finding some support for the validity of the biomarkers. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Newton_washington_0250O_29168.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/55117 | |
| dc.language.iso | en_US | |
| dc.relation.haspart | JGN_Thesis_2025_GDrive_SupplementalMaterials_1.zip; other; GoogleDrive csvs and extra figures. | |
| dc.relation.haspart | JGN_Thesis_2025_GDrive_SupplementalMaterials_2.zip; other; GoogleDrive csvs and extra figures. | |
| dc.relation.haspart | JGNThesis2025_SM(1).zip; other; HardDrive extra data csvs, and figures. | |
| dc.rights | none | |
| dc.subject | biomarkers | |
| dc.subject | cancer | |
| dc.subject | interpretable machine learning | |
| dc.subject | machine learning | |
| dc.subject | RNA seq | |
| dc.subject | TCGA | |
| dc.subject | Computer science | |
| dc.subject | Bioinformatics | |
| dc.subject.other | To Be Assigned | |
| dc.title | Interpretable Machine Learning for Biomarker Identification in RNA Seq Cancer Data | |
| dc.type | Thesis |
Files
Original bundle
1 - 4 of 4
Loading...
- Name:
- Newton_washington_0250O_29168.pdf
- Size:
- 6.62 MB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- JGN_Thesis_2025_GDrive_SupplementalMaterials_1.zip
- Size:
- 100.39 MB
- Format:
- Unknown data format
Loading...
- Name:
- JGN_Thesis_2025_GDrive_SupplementalMaterials_2.zip
- Size:
- 222.32 MB
- Format:
- Unknown data format
