Reframing Cox Proportional Hazards Model for Big Data and Neural Networks

Tarkhan, Aliasghar (Arash)

Reframing Cox Proportional Hazards Model for Big Data and Neural Networks

Files

Tarkhan_washington_0250E_25148.pdf (2.2 MB)

Date

2023-04-17

relationships.isAuthorOf

Tarkhan, Aliasghar (Arash)

Abstract

In many medical and biomedical applications, we measure the outcome as a “time-to-event” (e.g., disease progression or death). The aim of this dissertation is to propose frameworks for survival analysis and prediction with survival data that include many observations, ultra-high dimensional features, or images. We propose frameworks that are computationally efficient and stable and are amenable to stochastic-based optimization algorithms. Our proposed frameworks scale up to extremely large datasets that do not fit into memory. The aim of survival analysis is to assess the connection between the characteristics of a patient and the time-to-event outcome. To do this, it is common to assume a proportional hazards model and fit a proportional hazards regression (or Cox regression). A log-concave objective function known as the “partial likelihood” is maximized to fit the Cox proportional hazards model. For moderate-sized datasets, an efficient Newton-Raphson algorithm that leverages the structure of the objective function can be employed. In large datasets with lots of observations, this approach has two issues. First, the computational tricks that leverage the structure of the objective function can lead to computational instability. Second, the objective function does not naturally decouple over observations. Thus, the model can be computationally expensive to fit if the dataset does not fit into memory. In Chapter 2, we propose a novel modified framework for proportional hazards regression to address these issues. The proposed framework results in an objective function amenable to stochastic gradient descent. We show that this simple modification allows us to efficiently fit survival models with extremely large datasets, including lots of observations. Our proposed framework facilitates training complex, e.g., neural-network-based models with survival data. We propose a straightforward neural network architecture for survival prediction. The standard Cox model is known to behave poorly (the estimated coefficients may go to infinity) when the number of features is greater than the number of observations or even when the number of observations is greater than but close to the number of features. One way to handle this issue is to use regularization such that we have well-behaved solutions. The standard approaches (such as the lasso) for modifying Cox proportional hazards regression tend to fail for large-scale or ultra-high dimensional datasets because of computational instability and memory limits. In Chapter 3, we extend our proposed modification of the partial likelihood from Chapter 2 to the penalized partial likelihood to address these issues. In particular, our proposed framework enables data to be read off the hard drive in chunks to update our model sequentially. Therefore, our proposed modification facilitates fitting the penalized Cox model on larger datasets. We apply stochastic proximal gradient descent (SPGD) in our framework to fit Cox regression models with a convex combination of $l_1$ (lasso) and $l_2$ (ridge regression) penalties, also known as the elastic-net penalty. In Chapter 4of the dissertation, we tackle a problem that is a bit different from the survival analysis we discussed above. In computational pathology, training deep neural networks using giga-pixel whole-slide images (WSIs) is a challenging task. The fundamental challenge is the absence of annotation at the patch (instance) level because of the high cost and the time-consuming nature of hand labeling. This challenge is typically mitigated by pooling instances that rely on only the slide-level labels. For example, we see this with typical weakly supervised learning methods, MIL, and attention-based MIL. A WSI typically has hundreds of thousands of image patches, each of which may carry different information about the slide label/class. Training a deep neural network with thousands of image patches per slide is computationally expensive. We propose an adaptive sampling strategy that aims to save computation by adaptively selecting a subset of highly predictive instances. We also study other sampling strategies that aim to reduce computation and compare them with our proposed strategy. Although we proposed the adaptive sampling strategy for the classification problem, this idea is quite general and can easily apply to survival prediction.