Interpretable Machine Learning for Biomarker Identification in RNA Seq Cancer Data

Abstract

Existing research on RNA Seq gene expression biomarkers has provided various methods to select a small list of genes as cancer biomarkers from a large number of gene expression data. Previous methods for identifying potential gene expression cancer biomarkers have focused on statistical analysis, but other methods have incorporated machine learning, often including Interpretable Machine Learning (iML) techniques. On 16 cancer types from TCGA data, we used inherently interpretable machine learning models: Logistic Regression, Random Forest, and Linear Support Vector Machine to narrow down subsets of potential genes as biomarkers using the trained models' feature importance rankings. We subsequently applied model-agnostic iML techniques, such as Shapley Additive Explanations (SHAP) and Permutation Importance, to narrow down the subsets even further. We compared classification performance between machine learning models trained on iML selected features with features selected by statistical methods, and biomarkers from external research. We found that iML biomarker selection methods lead to comparable or better classification performance on these datasets than the biomarkers from outside research, or from statistical analysis alone. Mutual Information estimation (MI) was a surprisingly useful technique for initial feature selection, and iML techniques improved the MI selected features for classification. We cross-checked potential biomarkers with biomedical annotations and gene pathway analysis, finding some support for the validity of the biomarkers.

Description

Thesis (Master's)--University of Washington, 2025

Citation

DOI

Collections