Deep clustering to identify subgroups of multivariate trajectories in longitudinal biomedical datasets

Vemuri, Bhargav

Deep clustering to identify subgroups of multivariate trajectories in longitudinal biomedical datasets

dc.contributor.advisor	Tarczy-Hornoch, Peter
dc.contributor.author	Vemuri, Bhargav
dc.date.accessioned	2026-02-05T19:29:32Z
dc.date.available	2026-02-05T19:29:32Z
dc.date.issued	2026-02-05
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	Unsupervised patient subgrouping in longitudinal biomedical datasets enables the discovery of distinct temporal phenotypes that capture heterogeneity in disease progression, treatment response dynamics, developmental trajectories, telemonitoring, and more. One-stage multivariate time series (MVTS) deep clustering methods are well-suited to this task because they (1) jointly model multiple longitudinal variables and (2) integrate missing data imputation, representation learning, and clustering into a unified framework. Recent state-of-the-art MVTS deep clustering approaches include Variational Deep Embedding with Recurrence (VaDER; de Jong et al., 2019) and Clustering Representation Learning on Incomplete time-series data (CRLI; Ma et al., 2021). In this work, we apply CRLI in two real-world longitudinal biomedical contexts and evaluate its performance against VaDER using 20 synthetic MVTS datasets of our own design. Our overarching question was: how and when are one-stage MVTS clustering methods (VaDER, CRLI) useful in biomedical research data exploration? In Aim 1 (Assessing the ability of CRLI to detect meaningful trajectories in a sparse, irregular, biased real-world dataset), we explored CRLI’s capacity to detect multivariate trajectories in the electronic health record (EHR). Temporal EHR data is marred by irregular measurement intervals, high missingness, and multiple biases (selection, measurement, time-related). We assessed how well CRLI handles these hurdles in the context of identifying GLP-1 medication (semaglutide, dulaglutide, etc.) treatment response subgroups in the NIH All of Us Research Study. We showed that (1) CRLI can be used to identify post-treatment multivariate response trajectories in the EHR and (2) this is possible despite a small cohort (n=336) and infrequent measurements. In Aim 2 (Assessing the ability of CRLI to detect meaningful trajectories in a high-dimensional, multimodal, prospective dataset), we applied CRLI to another real-world data source, the Adolescent Brain Cognitive Development (ABCD) Study, a longitudinal observational cohort with a prespecified assessment protocol, including a consistent follow-up schedule and a high retention rate (98.9%). This dataset allowed us to explore physical health trajectories (pubertal hormones, anthropometrics) as we did in Aim 1, but also mental health trajectories, as measured by 8 Child Behavior Checklist (CBCL) syndrome scales. We calculated cluster associations with mental health outcomes to better characterize cluster differences. We showed that (1) given longitudinal and static variables, CRLI identified longitudinal trajectories that had non-uniform associations with static variables, providing a basis for testable clinical hypotheses, and (2) CRLI identified clusters that could not have been identified with a single timepoint or single variable alone. In Aim 3 (Assessing the ability of CRLI and VaDER to detect trajectories in synthetic datasets under diverse data constraints), we designed a framework using the mockseries Python package that let us rapidly generate unique MVTS datasets by sampling from a range of values for various datasets characteristics (time series length, noise, missingness, number of clusters, number of samples). We also incorporated the ability to modify time series variable properties (trend, rate of change, seasonality) by designing 5 distinct variable styles inspired by biomedical trends we observed in Aims 1 and 2 and the literature. We reported VaDER and CRLI performance on 4 external clustering validation indices (purity, RI, ARI, NMI) across 20 synthetic datasets. We showed that (1) practitioners should be wary of novel methods that do not report performance on adjusted metrics (ARI, AMI), (2) 2D visualizations are an invaluable interpretability tool, especially when there are too many longitudinal variables to understand on an individual basis, and (3) while CRLI generally outperforms VaDER, neither method achieved across-the-board ARI dominance. Cross-cutting contributions that emerged across the aims were as follows: (1) we observed that internal clustering validation indices (Calinski-Harabasz, Silhouette, Davies-Bouldin, S_Dbw validity) were rarely concordant, making the selection of optimal cluster number in Aims 1 and 2 complicated, (2) cohort selection criteria that required a minimum number of repeat measurements across multiple longitudinal variables resulted in final cohorts that may not have generalized well to the population and/or an external validation dataset, (3) method performance in Aim 3 as measured by Adjusted Rand Index (external clustering validation index) was subpar compared to other indices that have been reported in the literature, casting doubt on trustworthiness of clusters identified in previous Aims, and (4) visual (qualitative) inspection and interpretation of identified clusters is a necessary complement to quantitative clustering result evaluation (by internal and/or external clustering validation index) for a holistic understanding of trajectory differences between clusters.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Vemuri_washington_0250E_29143.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55115
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Deep clustering
dc.subject	Electronic health records
dc.subject	Real-world data
dc.subject	Representation Learning
dc.subject	Time series
dc.subject	Trajectory subgrouping
dc.subject	Bioinformatics
dc.subject.other	To Be Assigned
dc.title	Deep clustering to identify subgroups of multivariate trajectories in longitudinal biomedical datasets
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Vemuri_washington_0250E_29143.pdf
Size:: 15.27 MB
Format:: Adobe Portable Document Format

Download

Collections

To Be Assigned