Deep clustering to identify subgroups of multivariate trajectories in longitudinal biomedical datasets

dc.contributor.advisorTarczy-Hornoch, Peter
dc.contributor.authorVemuri, Bhargav
dc.date.accessioned2026-02-05T19:29:32Z
dc.date.available2026-02-05T19:29:32Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractUnsupervised patient subgrouping in longitudinal biomedical datasets enables the discovery of distinct temporal phenotypes that capture heterogeneity in disease progression, treatment response dynamics, developmental trajectories, telemonitoring, and more. One-stage multivariate time series (MVTS) deep clustering methods are well-suited to this task because they (1) jointly model multiple longitudinal variables and (2) integrate missing data imputation, representation learning, and clustering into a unified framework. Recent state-of-the-art MVTS deep clustering approaches include Variational Deep Embedding with Recurrence (VaDER; de Jong et al., 2019) and Clustering Representation Learning on Incomplete time-series data (CRLI; Ma et al., 2021). In this work, we apply CRLI in two real-world longitudinal biomedical contexts and evaluate its performance against VaDER using 20 synthetic MVTS datasets of our own design. Our overarching question was: how and when are one-stage MVTS clustering methods (VaDER, CRLI) useful in biomedical research data exploration? In Aim 1 (Assessing the ability of CRLI to detect meaningful trajectories in a sparse, irregular, biased real-world dataset), we explored CRLI’s capacity to detect multivariate trajectories in the electronic health record (EHR). Temporal EHR data is marred by irregular measurement intervals, high missingness, and multiple biases (selection, measurement, time-related). We assessed how well CRLI handles these hurdles in the context of identifying GLP-1 medication (semaglutide, dulaglutide, etc.) treatment response subgroups in the NIH All of Us Research Study. We showed that (1) CRLI can be used to identify post-treatment multivariate response trajectories in the EHR and (2) this is possible despite a small cohort (n=336) and infrequent measurements. In Aim 2 (Assessing the ability of CRLI to detect meaningful trajectories in a high-dimensional, multimodal, prospective dataset), we applied CRLI to another real-world data source, the Adolescent Brain Cognitive Development (ABCD) Study, a longitudinal observational cohort with a prespecified assessment protocol, including a consistent follow-up schedule and a high retention rate (98.9%). This dataset allowed us to explore physical health trajectories (pubertal hormones, anthropometrics) as we did in Aim 1, but also mental health trajectories, as measured by 8 Child Behavior Checklist (CBCL) syndrome scales. We calculated cluster associations with mental health outcomes to better characterize cluster differences. We showed that (1) given longitudinal and static variables, CRLI identified longitudinal trajectories that had non-uniform associations with static variables, providing a basis for testable clinical hypotheses, and (2) CRLI identified clusters that could not have been identified with a single timepoint or single variable alone. In Aim 3 (Assessing the ability of CRLI and VaDER to detect trajectories in synthetic datasets under diverse data constraints), we designed a framework using the mockseries Python package that let us rapidly generate unique MVTS datasets by sampling from a range of values for various datasets characteristics (time series length, noise, missingness, number of clusters, number of samples). We also incorporated the ability to modify time series variable properties (trend, rate of change, seasonality) by designing 5 distinct variable styles inspired by biomedical trends we observed in Aims 1 and 2 and the literature. We reported VaDER and CRLI performance on 4 external clustering validation indices (purity, RI, ARI, NMI) across 20 synthetic datasets. We showed that (1) practitioners should be wary of novel methods that do not report performance on adjusted metrics (ARI, AMI), (2) 2D visualizations are an invaluable interpretability tool, especially when there are too many longitudinal variables to understand on an individual basis, and (3) while CRLI generally outperforms VaDER, neither method achieved across-the-board ARI dominance. Cross-cutting contributions that emerged across the aims were as follows: (1) we observed that internal clustering validation indices (Calinski-Harabasz, Silhouette, Davies-Bouldin, S_Dbw validity) were rarely concordant, making the selection of optimal cluster number in Aims 1 and 2 complicated, (2) cohort selection criteria that required a minimum number of repeat measurements across multiple longitudinal variables resulted in final cohorts that may not have generalized well to the population and/or an external validation dataset, (3) method performance in Aim 3 as measured by Adjusted Rand Index (external clustering validation index) was subpar compared to other indices that have been reported in the literature, casting doubt on trustworthiness of clusters identified in previous Aims, and (4) visual (qualitative) inspection and interpretation of identified clusters is a necessary complement to quantitative clustering result evaluation (by internal and/or external clustering validation index) for a holistic understanding of trajectory differences between clusters.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherVemuri_washington_0250E_29143.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55115
dc.language.isoen_US
dc.rightsCC BY
dc.subjectDeep clustering
dc.subjectElectronic health records
dc.subjectReal-world data
dc.subjectRepresentation Learning
dc.subjectTime series
dc.subjectTrajectory subgrouping
dc.subjectBioinformatics
dc.subject.otherTo Be Assigned
dc.titleDeep clustering to identify subgroups of multivariate trajectories in longitudinal biomedical datasets
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Vemuri_washington_0250E_29143.pdf
Size:
15.27 MB
Format:
Adobe Portable Document Format

Collections