Interpretable Data Phenotyping for Healthcare via Unsupervised Learning
MetadataShow full item record
Healthcare applications of machine learning tend toward greater requirements for model transparency than most applications. Yet the often high dimensionality of the data presents a significant impediment to meeting this requirement, particularly as it relates to the underlying relationships contributing to an individual prediction. Thus emerged the concept of "data phenotypes", clinically relevant groupings that facilitate population statistics and reduce barriers in the development of quality machine learning models. However, the results of current phenotyping methods are often difficult to interpret, and they often require clarification from an experienced clinician to be useful. This is a problem for administration-level prediction problems in particular, for example Length of Stay prediction, because those developing the models are not commonly clinicians, and because the results of these models are often desired with a fast turnaround. With the above in mind, this thesis reviews the utility of four prominent phenotyping approaches: k-means, agglomerative clustering, non-negative matrix factorization, and non-negative tensor factorization. We propose variants of the four approaches with the goal of producing distinct feature membership. We then show that our proposals can produce easily understandable phenotypes at no detriment to prediction performance over some real healthcare tasks.