Tree-based ensemble methods for individualized treatment rules

ABSTRACT There is a growing interest in the development of statistical methods for personalized medicine or precision medicine, especially for deriving optimal individualized treatment rules (ITRs). An ITR recommends a patient to a treatment based on the patient's characteristics. The common parametric methods for deriving an optimal ITR, which model the clinical endpoint as a function of the patient's characteristics, can have suboptimal performance when the conditional mean model is misspecified. Recent methodology development has cast the problem of deriving optimal ITR under a weighted classification framework. Under this weighted classification framework, we develop a weighted random forests (W-RF) algorithm that derives an optimal ITR nonparametrically. In addition, with the W-RF algorithm, we propose the variable importance measures for quantifying relative relevance of the patient's characteristics to treatment selection, and the out-of-bag estimator for the population average outcome under the estimated optimal ITR. Our proposed methods are evaluated through intensive simulation studies. We illustrate the application of our methods using data from Clinical Antipsychotic Trials of Intervention Effectiveness Alzheimer's Disease Study.


Introduction
With recent advancement in basic science and technology (e.g. gene expression profiling), precision medicine or personalized medicine has been advocated in medical community and influenced policymaking (National Research Council, 2011). In 2015, U.S. government launched a $215 million Precision Medicine Initiative (Collins and Varmus, 2015). The basic concept of precision medicine is to make medical decisions, such as prescribing a particular drug or taking preventive health measures, based on individual patients' characteristics, including clinical information, demographics, genetics, environmental factors, etc.
A well-known example of personalized medicine is the preventative management of women who carry BRAC1 or BRAC2 mutations. Linkage studies and molecular biology confirmed that women with the mutations have higher risk of developing breast and ovarian cancer (Ford et al., 1998;Venkitaraman, 2002). Researchers developed individualized preventive management, such as removal of breasts or ovaries for women carrying BRAC1 or BRAC2 mutations (Roukos and Briasoulis, 2007). In biomarker research, individual measurements that predict the risk of developing a diease, like BRAC1 and BRAC2 mutations in this example, are called prognostic biomarkers. A prognostic biomarker is used to identify individuals with higher risk of certain disease.
Another type of biomarkers called prescriptive or predictive biomarkers can inform whether patients will benefit from a treatment. Patients who carry different values of predictive biomarkers would show differential treatment effects. A classic example comes from a clinical trial for women with primary operable breast cancer and positive axillary nodes from National Surgical Adjuvant Breast and Bowel Project (NSABP). There were two treatment arms in the trial, L-phenylalanine mustard and 5-fluorouracil with or without tamoxifen (RF or RFT). Researchers identified patients under 50 years old and with progesterone receptor level smaller than 10 femtomoles to be a subgroup that would benefit from RF, and other patients included in trials would benefit from RFT (Fisher et al., 1983). Patients' progesterone receptor level in this example is a predictive biomarker. An individualized treatment rule (ITR) can be developed based on predictive biomarkers and other predictive factors, including demographic, clinical or environmental information. In the previous example, an ITR can be described as follows: 1) if a patient is under 50 years old and has progesterone receptor level smaller than 10 femtomoles, we recommend to treat her with RF 2) elsewise, we recommend to treat her with RFT.
The above finding of NSABP is a result of subgroup analyses (Fisher et al., 1983).
In the standard subgroup analyses, researchers pre-specify subgroups defined by patients' baseline covariates based on a biologically plausible mechanism. The results of RCT are analyzed separately by different subgroups, and the coefficient of the interaction term between treatment and pre-specified baseline covariates is tested (Pocock et al., 2002). The goal of the standard subgroup analyses is to find a subgroup of patients for whom the treatment is particularly effective. The treatment is recommended for those patients in a subgroup who show beneficial outcome in the treatment arm. Subgroup analyses can be viewed as indirect ways for deriving an ITR for the whole population represented by RCT participants, and are inefficient when multiple biomarkers or covariates are examined as potential predictive factors. Subgroup analyses are not the focus of this thesis.
The focus of this thesis is on statistical methods of deriving optimal individual treatment rules based on data from randomized clinical trials (RCT). An optimal ITR is defined as a rule that maximizes the overall population average outcome under the rule if the outcome is clinically desired, such as CD4 counts in HIV patients or a rule that minimizes the overall population average outcome under the rule if the experimental treatment tries to prevent the outcome, such as tumor sizes. There are two general approaches for deriving an optimal ITR.
The first general approach is to estimate the difference of average outcomes between two treatment arms conditional on predictive biomarkers. This conditional treatment difference is natural for deriving ITR. To estimate the conditional treatment difference, we may use a generalized linear model, including the treatment, biomarkers and the interaction terms between treatment and biomarkers as predictors and clinical endpoint as outcome . This parametric approach suffers from sensitivity to misspecification of conditional mean model, especially when the biomarkers are high dimensional. When the conditional mean model is misspecified, the estimated conditional treatment difference is biased, and ITR based on the biased estimates will be suboptimal. Several methods have been proposed to reduce the bias in estimating the conditional treatment difference caused by model misspecification (Qian and Murphy, 2011;Matsouaka et al., 2014;Kang et al., 2014).
The second general approach targets at direct optimization of the population average outcome under an ITR. In 2012, three independent groups proposed a new framework of deriving ITR that recasts the problem of maximization of treatment benefit as a weighted classification problem (Zhang et al., 2012a;Rubin and van der Laan, 2012;Zhao et al., 2012). Several methods were proposed under this framework (Zhao et al., 2012;Huang and Fong, 2014;Huang, 2015). In this thesis, we propose to use the techniques of random forests (Breiman, 2001) under this framework to solve the weighted classification problem in order to estimate an optimal ITR. The reminder of this thesis is organized as follows. In Chapter 2, we review the current research in deriving optimal ITR. We extend the random forests method to solve the weighted classification problem and treatment selection problem in Chapter 3. In Chapter 4, we conduct stimulation studies. In Chapter 5, we use data from Clinical Antipsychotic Trials of Intervention Effectiveness Alzheimer's Disease Study (CATIE-AD) to demonstrate the application of our proposed method. The thesis is concluded in Chapter 6 with discussions on the topic.

Chapter 2
Review Of Current Research

Setting And Definition Of The Optimal ITR
We consider data collected from a two-arm randomized clinical trial. Let Y be the observed clinical outcome, which can be either a continuous outcome or a binary outcome. Without loss of generality, we assume Y is an undesired binary outcome (e.g. death), which the treatment intends to prevent. Hence, E(Y ) is the event rate of the binary outcome in the population. Let X be p-dimensional baseline covariates, which can be either categorical or continuous. Let A be the treatment assignment, with A = 1 for the experimental treatment and A = 0 for the control or standard of care. By randomization, A is independent of X. We observe n copies of independent and identically distributed (Y i , X i , A i ) from the RCT, where i = 1, ..., n. Let g(X) : X → (0, 1) be the individualized treatment rule (ITR) based on covariates X, which maps from the covariates space to treatment decision: to treat patients with covariates X if g(X) = 1 or not to treat if g(X) = 0.
It is helpful to describe the ITR under the potential outcome framework. Let Y(a) be the potential outcome under a ∈ (0, 1). Here, we assume the Stable Unit Treatment Value Assumption (SUTVA); i.e., there is no multiple version of the same treatment, and one patient's outcome does not depend on any other patient's treatment assignment. Under SUTVA, the observed outcome is connected with the potential outcome by Y = Y (1)A + Y (0)(1 − A). By randomization, we have (Y (0), Y (1))⊥A. Hence, we have E(Y (a)|X) = E(Y |A = a, X); i.e., the potential event rate for treatment option A = a conditional on X can be estimated by the event rate among patients with baseline covariates X in the be the difference in the potential event rates between two treatment options, conditional on X. For a subpopulation defined by baseline coveriates X = x, to minimize the expected event rate, a treatment rule for this subpopulation is to treat them if ∆(x) < 0 and not to treat them if ∆(x) ≥ 0. As shown in the following paragraph, if we apply this treatment rule to every subpopulation, the expected event rate of the whole population will be minimized.
To develop ITR under the potential outcome framework, let's define the potential outcome if we follow ITR g(X) as follows: Then, E X,Y [Y (g(X))] is the average population outcome if we follow ITR. The optimal ITR based on X is defined as follows: i.e., an ITR is considered optimal if we can minimize the population event rate under this ITR. We can rewrite E[Y (g(X))] as follows: (2.1) where 1(·) is the indicator function (Zhang et al., 2012a).
In addition, we notice that this optimal ITR defined by minimization of the population event rate under the ITR has nice interpretations as average casual effect. g opt (X) = 1(∆(X) < 0) also maximizes following four quantities: ) is the potential outcome if we defy g(X).
(2.2a) is the difference of average event rate if we follow g(X), compared to a rule that would treat all patients; (2.2b) is the difference of average event rate if we follow g(X), compared to a rule that would treat no patient; (2.2c) is the difference of average event rate if we follow g(X), compared to a rule that assign all patients to a treatment option which minimizes the whole population event rate; (2.2d) is the difference of average event rate comparing g(X) toḡ(X). Therefore, g opt (X) = 1(∆(X) < 0) maximizes the benefit, the difference in event rate, compared to other possible rules.
Note that the above results can be naturally extended to a continuous outcome. If a lower clinical outcome Y is desired, the optimal ITR is still g opt (X) = 1(∆(X) < 0). If a higher clinical continuous outcome or higher event rate of binary outcome is desired, the optimal ITR will be defined as g opt (X) = argmax g(X) E[Y (g(X))] and, following the same derivation, we will have the optimal ITR to be 1(∆(X) > 0). Because g opt (X) = 1(∆(X) < 0) is the optimal rule, many research in predictive biomarkers and ITR focuses on estimating ∆(X), which are reviewed in the next section.

Derive Optimal ITR Based On∆(X)
A straightforward approach to estimate ∆(X) is to impose a parametric model for the outcome. For example, we can assume a generalized linear model (GLM) with a mean is the link function (e.g. logit function is commonly used for binary outcome), and β, γ are the regression parameters.
Under the proposed model, ∆(X) is estimated by: However, like all parametric approaches, this approach is prone to model misspecification, especially misspecification of the mean model. In fact, the misspecification of mean model can have huge impact on the ITR estimated based on the parametric model. An estimated ITR based on proposed mean model,ĝ opt (X,β,γ), is restricted to a class of decision rule (Zhang et al., 2012b). Note that The inverse of link function f −1 is monotone, hencê , and ∆(X,β,γ) is a consistent estimate of ∆(X), andĝ opt (X,β,γ) is a good approximation of g opt (X). However, when E(Y |A, X, β, γ) is not correctly specified, and the boundary of optimal ITR is not linear in (1,X), i.e., the optimal ITR is not within the class taking the form: 1[Xγ 1 + β 1 > 0] or 1[Xγ 1 + β 1 < 0],ĝ opt (X,β,γ) may not be a good approximation for the optimal ITR. For example, suppose the decision boundary of optimal ITR based on 2 dimensional covariates is a circle in the covariates space: patients with covariates value inside the circle benefit from the treatment, and patients with covariates value outside the circle do not benefit from the treatment. In this case, a GLM including covariates as linear terms is misspecified. A GLM including covariates as linear terms would force the decision boundary to be a straight line, which will never be a good approximation of a circle; thus ITR depending on this parametric model will be far from optimal. Scenario II in simulation studies in Chapter 5 demonstrates this situation.
Several methods have been proposed to reduce the impact of misspecification of the mean model on ITR. When X is just a single covariate / 1-dimensional biomarker, one could estimate E(Y |X, A = 1), E(Y |X, A = 0) via kernel smoothing; hence,∆(X) is estimated nonparametrically. The optimal ITR can be approximated by 1[∆(X) < 0] (Matsouaka et al., 2014). In the case of time-to-event outcome, Zhou and Ma (2012), Ma and Zhou (2014) consider a varying-coefficient proportional hazard regression model: where λ(t) is the hazard function, and β(X) is the coefficient of the treatment indicator which is a function of X. β(X) is estimated nonparametrically by local partial likelihood approach, and β(X) can be used to select treatment like ∆(X) because under the proportional hazard assumption, 1[∆(X) < 0] = 1[β(X) < 0] and 1[∆(X) > 0] = 1[β(X) > 0]. In the case of binary outcome, Han et al. (2015) consider a varying-coefficient model with the following logit link, where µ(X, A) = P (Y = 1|X, A); Here, β(X) is estimated using B-spline methods, and again β(X) can be used to select treatment like ∆(X). For example, if we have a desirable outcome, such as treatment response, we should treat patients with β(X) > 0.
One advantage of studying a single biomarker is that we could plot∆(X) against X or percentile of X with confidence band around the curve, which is a useful way to evaluate a predictive biomarker graphically . Similarly, we can plotβ(X) against X, which is termed the covariate-specific treatment effect curve (CTE) in time-to-event outcome scenario (Ma and Zhou, 2014;Zhou and Ma, 2012) or the conditional average treatment effect (CATE) curve in binary outcome scenario (Han et al., 2015).
In the presence of multiple biomarkers, we want to combine information from multiple biomarkers to guide treatment selection. The method that uses multivariate kernel smoothing to estimate the conditional mean model, E(Y |X, A), suffers from the "curse of dimensionality" (Matsouaka et al., 2014). Another general purpose nonparametric regression method, random forests, has also been used to estimate E(Y |X, A) (Taylor et al., 2015). In Chapter 5, we compare the performance of random forests that estimate E(Y |X, A) and derive ITR with our proposed approach (developed in Chapter 3) through simulation studies. Cai et al. (2011) proposed a two-step method to estimate a quantity similar to ∆(X), and Matsouaka et al. (2014) extended it to treatment selection. In the first step, similar to what we discussed in the beginning of this section, a parametric model is used to create an univariate score S(X) = ∆(X,β,γ) for each observation. In the second step, a non-parametric method is used to estimate the average treatment difference condition- , which is then used to select treatment. Note that is not the same as the treatment difference conditioning on X, which suggests that this approach depends on specification of E[Y |X, A, β, γ], and the ITR derived based on this approach can be suboptimal. Qian and Murphy (2011) considered a richer mean model by expanding the basis functions X from p-dimensional baseline covariates space to higher dimensions. For example, supposing that we have X = (X 1 , X 2 ), we could expand the basis functions to X expand = (X 1 , X 2 , X 1 X 2 , X 2 1 , X 2 2 ). The new model is: E(Y |A, X expand , β, γ) = β 0 + β 1 A + X expand γ 0 +AX expand γ 1 . A linear decision boundary in the expanded covariates can approximate a non-linear decision boundary in the original space. However, as we are expanding the covariates space, it quickly becomes a high-dimensional problem. Qian and Murphy (2011) proposed to use L 1 penalized least squares (L 1 − P LS) method, also known as LASSO.
We can rewrite the mean model: E(Y |A, X expand , β, γ) = E(Y |X * , β * ) = X * β * , where X * = (1, A, X expand , AX expand ) and β * = (β 0 , β 1 , γ 0 , γ 1 ). Using the LASSO to estimate β * : where σ j = [1/n n i=1 (x * i β * j )] 1/2 is the weight to balance the scale of different basis function, and λ is the LASSO penalty parameter, which can be selected by cross-validation. Then, , and the ITR can be constructed based on∆(X). Kang et al. (2014) proposed a boosting iterative algorithm to reduce the bias and misclassification caused by misspecification of the working mean model. In iteration m, a working mean model, E(Y |A, X, β, γ), is fitted with more weights given to observations whose∆(X) (m−1) estimated in the previous iteration is near zero. The final∆(X) is the average of∆(X) (m) from all iterations. Authors argued that observations with∆(X) (m) near 0 are more likely to be given wrong treatment recommendation and hence should be given more weights. This method has been criticized by a group of discussants of the paper for its sensitivity to working model specification and lack of theoretical justification (Tian, 2014;Zhao and Kosorok, 2014) All of above methods for treatment selection firstly estimate the conditional treatment difference, ∆(X), and then select treatment based on the estimated conditional treatment difference. These methods can be considered as indirect methods compared to the weighted classification framework that we describe in the next section.

Derive Optimal ITR Under Weighted Classification Framework
In this section, we review the weighted classification framework proposed by Rubin and van der Laan (2012) and Zhang et. al. (2012). Unlike the methods described in Section 2.2, the methods under this framework aim to directly solve the problem of optimization of population average outcome under the ITR, i.e. E[Y (g)], instead of modeling E(Y |X, A) and constructing ITR based on∆(X).
In Section 2.1, we define the optimal ITR as g opt (X) = argmin g(X) E[Y (g(X))] when our treatment goal is to decrease a continuous outcome Y , such as blood pressure among patients with hypertension, or to reduce the probability of an undesired binary outcome, such as mortality rate; g opt (X) = argmax g(X) E[Y (g(X))] when our treatment goal is to increase a continuous outcome Y , such as CD4 count in HIV infected patients, or to increase the probability of a desired binary outcome, such as treatment response rate. Without loss of generality, here we follow the derivation of Zhang et. al. (2012) to cast this optimization problem g opt (X) = argmin g(X) E[Y (g(X))] as a weighted classification problem for situation where lower Y is desired.
. Now, we can rewrite g(X)∆(X) as follows: Since the 1(∆(X) < 0) and g(X) take value either 1 and 0, ] 2 ) can be viewed as the expected weighted misclassification error rate, where |∆(X)| is the weight; 1(∆(X) < 0) is the binary class label, and g(X) is the classifier.
In addition to above abstract mathematical derivation, the minimization of the weighted classification error rate has an intuitive interpretation. The class label is the optimal treatment rule, g opt (X) = 1(∆(X) < 0), as we have shown in Chapter 2. We would like to have an ITR g(X) matched with the g opt (X). Hence, we would like to minimize the misclassification error rate. If a misclassication error occurs, should we give every misclassification error the same weight? This weighted classification framework says that the weight of this misclassification should be |∆(X)|. Suppose a subpopluation defined by covariates x * is misclassified by g(x * ); i.e., g(x * ) = 1 − g opt (x * ) = g opt (x * ). This misclassification would increase E[Y (g(X))], the average outcome under the ITR for the whole population, by which is the absolute difference of the average outcomes under the optimal rule and the average outcome defying the optimal rule of this subpopulation. Hence, if our goal is to minimize E[Y (g(X))] with respect to g(X), we should weight each misclassification error with |∆(X)|.
To solve this minimization problem with the observed data, we approximate the objective function, an expectation, by its empirical counterpart, a sample average: In addition, we need to estimate the unobservable ∆(X i ) by information from the sample. Zhang and others (2012) proposed two consistent estimators for ∆(X i ) for individual i, namely the inverse probability weighted estimator (IPWE) and the augmented inverse probability weighted estimator (AIPWE). We describe these estimators in details below.
The IPWE of ∆(X i ) is given as follows: where π i = P (A i = 1) is the probability of individual i being assigned to treatment group. In randomized clinical trials, π i , also known as propensity score, is known by the study design, and∆(X i ) IP W E is a consistent estimator. In observational studies, an extra effort is needed to estimated π i based on subjects' covariates.
The AIPWE of ∆(X i ) is given as follows: based on a working model for the outcome Y. Although the working model for the outcome could be misspecified,∆(X i ) AIP W E is again a consistent estimator because π i is known in randomized clinical trials. In addition, compared to IPWE, AIPWE gains more precision by utilizing the information from baseline covariates.
In summary, under this framework, obtaining an optimal ITR from a randomized clinical trial can be cast as a weighted classification problem: Interestingly, Zhao and other (2012) use a different derivation to reach the same result, i.e. casting optimization of E[Y (g(X))] as minimization of the sample averaged weighted misclassification error, where∆ is estimated by IPWE (Zhang et al., 2012a). Authors call their method the 'outcome weighted learning (OWL)'.
The advantage of this weighted classification framework is that the ITR, g(X), is no longer restricted by any assumption of the outcome model, E[Y |X, β, γ, A], like approaches we discussed in Chapter 2. We can choose the class of g(X) as a linear or nonlinear combination of all X or subset of X or place no restriction on the form of g(X). Hence, this framework provides a direct way to deriving ITR, which is more flexible and robust than the methods discussed in Chapter 2.
After casting the goal of deriving an optimal ITR as a weighted classification problem, various well developed classification techniques can be applied. Some researchers tackled this problem by using optimization algorithms closely related to support vector machine (SVM). Using the language in optimization, (2.5) is a weighted sum of 0-1 loss, which is difficult to minimize. We could restrict g(X) to the linear combination of X, Xγ, and replace 0-1 loss by a convex and continuous surrogate loss function, such as the hinge loss (Zhao et al., 2012) and exponential loss (Huang, 2015), or the difference of two convex loss functions, such as the ramp loss (Huang and Fong, 2014) and the smooth ramp loss (Huang, 2015). Various convex optimization procedures can be then deployed to minimize this objective function. However, by restricting g(X) to the linear combination of X, the decision boundary of ITR is restricted to linear boundary. To be more flexible in decision boundary, Zhao et al. (2012), Huang and Fong (2014) and Huang (2015) used non-linear kernels, such as radial basis function (RBF) kernel, to replace the linear kernel, which enables the algorithm to identify a non-linear decision boundary for ITR.
Another family of methods for classification is tree-based methods. Zhang and others (2012) used classification and regression tree (CART) to minimize the objective function in (2.5). Tree-based models are easy to interpret, but they restrict the decision boundary to a certain class, and CART is known to be unstable and lack of smoothness (Hastie et al., 2009). Rubin and van der Laan (2012) used bagging trees to solve the minimization problem, which alleviates the limitation of CART. Another popular classification and regression method is random forests (Breiman, 2001), which is demonstrated to have greater predictive power in many settings when compared to bagging trees and SVM. In this thesis, to extend the method of bagging trees, we use random forests to minimize the objective function in (2.5). Methods like SVM with RBF kernel, bagging trees and random forests sometimes are called 'black box' models, which are difficult to interpret and provide little clinical insights on different biomarkers. Hence, motivated by the lack of interpretability of random forests, we also discuss the variable importance measures in the context of ITRs. The variable importance measures are also helpful in selecting predictive biomarkers among larger number of potential candidate biomarkers to ensure stability of the derived ITRs. Under the weighted classification framework, Huang (2015) proposed an algorithm to select predictive biomarkers when the decision boundary is restricted to linear combinations of biomarkers. The proposed variable importance measures quantify the relative relevance of the patient's characteristics to treatment selection without restriction on forms of combination of biomarkers.

Random Forests Under The Weighted Classification Framework
Breiman and others popularized the tree-based models in their monograph, Classification And Regression Tree (CART) (Breiman et al., 1984). CART is a general model for regression and classification. Based on CART, Breiman proposed two new methods called bagging (bootstrap aggregation) trees (Breiman, 1996) and random forests (Breiman, 2001). Random forests have been one of the most popular off-the-shelf methods for classification and regression (Hastie et al., 2009). The excellent performance of random forests is at the cost of interpretability. Breiman et. al. proposed two types of variable importance measures in tree-based models and random forests, which could help to gain insight of relative importance of each covariates on outcome (Breiman et al., 1984;Breiman, 2001). Here, we apply the idea of random forests to solve the weighted classification problem as stated in (2.5). In addition, we develop and compare two types of variable importance measures in the context of deriving the optimal ITR. Next, we first describe a single tree classifier, the building block of random forests, in the context of deriving the optimal ITR.

A Single Decision Tree
Recall that under the weighted classification framework discussed in the previous chapter, Let us denote 1(∆(X i ) < 0) be Z i , the estimated binary class label of subject i. Let |∆(X i )| be W i , the estimated weight for misclassification error of subject i. Recall that∆(X i ) can be estimated by IPWE or AIPWE. Assume that subjects' baseline covariates X ∈ R p . The problem is to find a classifier g(X) : R p → {0, 1}, such that the weighted misclassification error is minimized.
A tree based classifier, T (X) can be written as follows: where R k 's are the mutually exclusive and collectively exhaustive regions, which partition the covariates space R p . Here, c k ∈ {0, 1} is the predicted class label for observations in the region R k . In other words, patients in the same region defined by their covariates X is recommended to the same treatment rule, g(X) = c k 1(X ∈ R k ). A weighted version of CART method is used to construct R k and c k based on the covariates X, the binary label Z and the weight W through a recursive binary splitting algorithm as described in the following paragraphs.
At start, all the observations (x i , w i , z i ), where i = 1, ..., n, are in the root node, N 0 . In the tree terminology, a node is also a region of covariates space. The root node is just the entire covariates space, R p . We can define the weighted misclassification error for the root node Q 0 as follows: In the decision tree terminology, Q 0 is called impurity measure (of the root node). For tree-based models, the impurity measure is quantity that measures the heterogeneity in each node and guides the construction of trees. The recursive binary splitting algorithm aims to minimize the impurity measure.
We can split the root node N 0 into two children nodes N 1 and N 2 by j th covariate and cutoff point s: N respectively. To achieve the goal of reducing the weighted misclassification error, we want to choose (j, ]. That is to choose (j, s) to achieve maximum decrease in the weighted misclassification error. Computationally, we consider all the covariates and find a best cutoff point that maximizes decrease in the weighted misclassification error for each covariate; then from p candidates, we choose a covariate with its best cutoff point that maximizes decrease in the weighted misclassification error. Henceforth, we drop the superscript (j, s) from regions and impurity measure notations. Let's denote δ l=1 (Q) j = (Q 0 − Q 1 − Q 2 ) to be the decrease of weighted misclassification error at the first split (l = 1), which is split at j th covariate.
Set δ l (Q) j = 0 if the l th split is not split by X j . This quantity will be used to construct one of variable importance measures, which is described in Section 3.4. After the first split, N 0 became an internal node, and N 1 and N 2 became the terminal nodes, which are subject to the next split. The splitting process is recursively repeated to the terminal nodes and produces a tree like structure model. Ultimately, we have a big tree with each terminal node only containing 1 observation. If we use these terminal nodes as regions for a tree model, it is likely that we will have an overfitting issue. The recommended practice to avoid overfitting in tree model is to grow a big tree first and then use some pruning methods to collapse the terminal nodes, which results in a smaller and less complex tree (Hastie et al., 2009).
However, overfitting of individual trees is not a problem in random forests (Breiman, 2001).
In a random forest, we grow a complex tree without pruning. Therefore, we skip discussion of pruning details. The final terminal nodes are the regions, R k . And c k is the label that minimizes the weighted misclassification in each region. If there is only one observation in a region, c k will be the observed class label.
Suppose some of covariates are unordered categorical variables. For each categorical variable with finite levels, there are finite possible combinations to split the sample into two nodes. The splitting procedure for the categorical variable is to find the best combination which gives the maximum decrease in weighted misclassification error.
A single tree model has several drawbacks. Because of the binary splitting, the decision boundary is restricted to boundary of some unions of rectangle regions. Because of the hierarchical structure of tree, the earlier splits would impact the later splits, which make the tree model unstable. Few outliers could influence an earlier splits, affect latter splits and change the whole structure of a tree (Hastie et al., 2009). In addition, pruning a tree requires extra effort to choose a tuning parameter. Those drawbacks lead to moderate classification performance for a single tree. In the next section, we describe the method of random forests in deriving optimal ITR under the weighted classification framework, which overcomes these drawbacks of a single tree model.

Weighted Random Forests For Estimating ITR
The method of random forests is an extension of bagging trees. 'Bootstraping aggregation' (bagging) is a method to minimize the instability of a predictive model, such as CART (Breiman, 1996). Here, we first describe the bagging trees.
We have a training dataset D = {(x 1 , y 1 , a 1 ), ..., (x n , y n , a n )} of size n from a clinical trial. The first step is bootstrapping the training data D B times, which gives bootstrapped samples D b , where b = 1, ..., B. For each D b , the binary label z i and the weight w i are estimated by IPWE or AIPWE. Then, we can use CART to build a single tree model without pruning, T b , based on each bootstrapped sample as described in Section 3.1. The final model is a committee of T b . For new data with covariatesx, the predicted value, a binary classification, from the bagging trees is the majority votes from the committee of trees, denoted as {T b (x)} B 1 . By bootstrapping sample and aggregating results through the majority voting from the committee of trees, the stability and prediction accuracy of a model is improved. Also, overfitting is not an issue as the aggregated prediction is not made from a single tree (Breiman, 1996). In addition, the decision boundary is not limited to boundaries of unions of rectangle regions which could be produced by a single tree.
Since every individual tree is built on a bootstrapped sample of the same training data, there are correlations between trees. Breiman (2001) proposed the random forests, which randomly selects a subset of covariates as candidate covariates for each split. Through this random selection of covariates, the correlation between trees is reduced. Same as bagging trees, for new data with covariatesx, the predicted value is the majority votes from the committee of trees, denoted as RF (x) = {T b (x)} B 1 . We propose to use this tree-based ensemble algorithm with individual weights estimated from each boostrapped sample and random selection of candidate covariates at each split to estimate the ITR. We term the algorithm Weighted Random Forest (W-RF).
In general, random forests have better performance than the bagging trees because of smaller correlation between individual trees in random forests. Hastie et al. (2009) suggested we should treat the number of covariates randomly sampled as candidates at each split, often denoted as mtry, as a tuning parameter. Notice that, in our algorithm, the individual weights and labels are estimated using each bootstrapped sample instead of all training data; hence, the weight or label for an individual subject could be slightly different, which can further reduce the correlation between trees. Based on results of our numerical experiments, this strategy has moderate improvement in performance compared to estimating individual weights or labels from the whole sample before the bootstrap procedure.

Evaluation Of The Derived Rule
In the previous sections, we propose the weighted random forests (W-RF) method to derive an optimal ITR based on data from clinical trials. Before the derived ITR is used in practice, we need to evaluate performance of the derived ITR and compare it to ITRs derived from other methods. There are two performance measures for the evaluation of derived ITRs. The first performance measure is the misclassification error rate (MCR) of an ITR, P r(1[g opt (X) =ĝ opt (X)]). However, only in the simulation studies where g opt (X) is known, MCR can be estimated by the empirical proportion of disagreement between the optimal ITR and the derived ITR in the population. Hence, it can not be used as a performance measure in real data analyses. The second performance measure is the population average outcome under the derived ITR, E[Y (ĝ opt )], as the optimal ITR is defined as a rule optimizing E[Y (g)]. In the following paragraphs, we discuss the possible ways to estimate the population average outcome under the derived rule.
A naive approach is to estimate the population average outcome under the rule using the same data that is used to derive the rule, which is often termed 'naive estimate'.
In essence, deriving an optimal ITR is a prediction problem, which is to predict whether a patient with specific characteristics would benefit from a treatment. In all prediction problems, statistical models derived from optimization procedures on training data, including maximum likelihood estimation for the conditional mean model and direct optimization procedures including our method, have the risk of overfitting. Because E[Y (g)] is also the quantity we use to evaluate the ITR, the direct optimization approach could be even more likely to overfit the training data than indirect methods that depend on∆(X). For future patients with covariatesX,ĝ opt (X) might be far from maximizing E[Y (g(X))]. Hence, in the existing literature, in order to evaluate performance of an estimated ITRĝ(X) in simulation studies, a large testing dataset is typically generated to estimateÊ[Y (ĝ(X))] (Huang and Fong, 2014;Huang, 2015;Matsouaka et al., 2014;Zhao et al., 2012). Suppose we derived a treatment ruleĝ(.) from the training data, the empirical estimate of E[Y (ĝ(.))] using the testing data (X i ,Ã i ,Ỹ i ), i from 1 to n * , is as follows: . (3.1) However, in analyses of real data, there is no large testing data available in most cases. The most common method adopted in predictive modeling practice is K-fold crossvalidation. The original dataset is randomly split into K equal-sized folds; K − 1 folds are used as a training dataset to derive ITR, and the rest of one fold are used as a testing dataset to estimate E[Y (ĝ(X))]; the procedure is repeated K times, which allows every fold of the original dataset being used as a testing dataset; the average of KÊ[Y (ĝ(X))] is then calculated. One can further repeat the random split process multiple times and take the average of the estimates. Because only (K − 1)/K of the original dataset is used to build the model, the model performance is underestimated especially when K is small; on the other hand, a large K, such as leave-one-out cross-validation, can have large variance and also be computationally intractable. Hence, it is recommended to use 5-fold or 10-fold crossvalidation (Hastie et al., 2009). Five or 10-fold cross-validation has been used in the data analysis example of evaluating a derived ITR in the existing literature (Huang and Fong, 2014;Huang, 2015;Matsouaka et al., 2014;Zhao et al., 2012). For our proposed algorithm, we consider an alternative and novel method for estimation of population average outcome under the derived ITR, which is described in the following paragraph.
In the classic context of classification, out-of-bag error estimates from random forests have been used to replace the K-fold cross-validation (Breiman, 2001;Hastie et al., 2009). Here, we introduce the out-of-bag ( , is an OOB prediction for T b . We aggregate T b (D oob b ) and use the majority vote rule to construct OOB estimated ITR of the original training dataset for the random forests model. Let's denoted it as RF oob (X). Similar to the problem of prediction error estimate in the standard classification context, we can useĝ(X) oob = RF oob (X) and the original training dataset (X, A, Y ) to estimateÊ[Y (ĝ(X))] oob using (3.1) to evaluate the performance ofĝ(X). In Chapter 4, we numerically assess the quality ofÊ[Y (ĝ(X))] oob by comparing it to the estimate from the large testing set, 5-fold CV estimate and naive estimate in simulation studies.

Variable Importance
Although random forests have better prediction performance than a single decision tree, the aggregation of trees loses a simple interpretation. The variable importance measures in random forests are helpful in exploratory analysis to rank the covariates according to their predictiveness (Hastie et al., 2009). Here, we extend the concept of variable importance measures in the context of deriving optimal ITRs. Variable importance measures for each covariate can help us select the predictive markers associated with treatment decision from multiple candidate markers in an exploratory analysis. We propose two different variable importance measures for p covariates, which are described in the following paragraphs.
The first type of variable importance measure for covariates, denoted as V I 1 (X j ), is sum of decreases in weighted misclassification error for all splits attributed to X j , averaged over all trees. To elaborate this definition, recall the decrease of weighted classification error δ l (Q) j described in Section 3.1. Suppose there are L splits in tree b; the first type of variable importance measure of X j in tree b is defined as V I 1 (X j ) b = L l δ l (Q) j . Taking the average over B trees, we have V I 1 (X j ) = 1 B B b=1 V I 1 (X j ) b . The higher value of V I 1 (X j ) suggests X j is more relevant to minimize the weighted classification error in (2.5) hence more relevant to treatment selection.
The second type of variable importance measures is related to the concept of outof-bag (OOB) estimation of E[Y (ĝ(X))], which is described in Section 3.3. For b th OOB sample, D oob b , we can permute j th covariates, x j , which gives D oob b (j). We repeat this permutation procedure for all OOB samples, and an OOB prediction for a random forest with x j permutated, denoted as RF oob (x (j) ), is obtained by majority votes from all OOB samples.
The second type of variable importance measure, V I 2 (X j ), is defined asÊ[Y (ĝ(X (j) ))] oob , which is the OOB estimate of E[Y (ĝ)] with X j permuted. If we want to minimize E[Y (g)], permuting a variable relevant to the treatment selection will increase E[Y (g)]. Hence, the higher V I 2 (X j ) suggests greater importance of X j in minimizing E[Y (g)]. Similarly, when our goal is to maximize E[Y (g)], the lower V I 2 (X j ) suggests greater importance of X j in maximizing E[Y (g)].

Chapter 4 Simulation Study
In this chapter, we conduct simulation studies to numerically evaluate the performance of
Here, C(X) represents a complicated generative model, which is hard to correctly specify.
For scenario I, we have δ(X) = 3(X 1 + X 2 ), which represents a linear optimal decision boundary. For scenario II, we have δ(X) = 9 − 4X 2 1 − 4X 2 2 , which represents an optimal decision boundary made of a circle. For scenario III, we have δ(X) = 5[21(X 1 > −0.5 & X 2 > −0.5) − 1], which represents a decision boundary that can be represented as 'and' / 'or' rule based on X. Figure 4.1 shows the optimal decision boundaries and best linear boundaries that approximate the corresponding optimal decision boundaries. The plots are overlaid with 5000 simulated observations from these three scenarios. Predictive biomarkers (X 1 , X 2 ) are generated from bivariate normal distribution with mean 0, variance 1 each and correlation 0.2. Other covariates X j , where j = 3, ..., p, is generated independently from a standard normal distribution. In this study, there are 500 simulated Monte Carlo datasets for each subsetting of every scenario.
We compare W-RF with the following existing methods: • The L 1 − P LS method (PLS) (Qian and Murphy, 2011) • Random forests method that directly estimates E(Y |X, A) and select treatment based on∆(X) (Direct-RF) • The robust kernel methods proposed by Huang and Fong (2014) (HF).
• Weighted logistic regression with L1-norm penalty (Huang, 2015) (W − logistic − l 1 ) PLS and Direct-RF first estimate E(Y |X, A) and then select treatment based on ∆(X); i.e.,ĝ(X) = 1(∆(X) > 0). In this simulation, PLS uses (1, X, A, XA) as the basis function set, and the LASSO procedure is implemented with the tuning parameter selected by cv.glmnet function with default settings from glmnet R package. Hence, the decision boundary of PLS is linear in X. In this simulation, Direct-RF uses (X, A) as input covariates.
It is implemented by randomF orest function from randomF orest R package with default settings except that the number of covariates randomly sampled as candidates at each split, mtry, is selected among ([(p + 1)/3], [2(p + 1)/3], p + 1) based on OOB estimates of mean square error ofŶ , where p is the dimension of X, and [.] is the function rounding a real number to the nearest integer.
HF is one of the existing methods under the weighted classification framework as we mentioned in Chapter 3. HF is implemented by the R code which can be found in the supplemental material of Huang and Fong (2014). When a linear combination of X is considered (HF-linear), there is one tuning parameter, λ, and when the non-linear kernel (the radial basis function) is considered (HF-RBF), there are two tuning parameters, λ and γ. The tuning parameters are selected by 5-fold cross-validation.
In addition to using HF-linear, HF-RBF and W-RF to solve the weighted classification problem with potentially many irrelevant covariates, we also included weighted logistic regression with L1-norm penalty (W − logistic − l 1 ) in the context of deriving ITRs (Huang, 2015). It is implemented with (1, X) as the design matrix, estimated individual weights and the tuning parameter selected by cv.glmnet function.
W-RF is implemented by our R functions, which can be found in the Appendix A. To compare theĝ(X) derived from different methods, a large testing dataset of N = 100, 000, (X i ,Ã i ,Ỹ i ), is generated. E[Ỹ (ĝ(X))] is estimated from the testing dataset using the estimator described in (3.1). Since we know the true data generating mechanism, g opt (X i ) = 1(∆(X i ) > 0) is known for each individual in the testing dataset, and E[Ỹ (g opt (X))] thus can be estimated using (3.1) as well. We can calculate the bias of the population average outcome under the estimated ITR compared to the population average outcome under the true optimal ITR; that is, E[Ỹ (g opt (X))] − E[Ỹ (ĝ(X))]. Smaller biases indicate the better performance of methods in deriving the optimal ITR. Since g opt (X i ) is known, we can also estimate the misclassification error rate (MCR) of the estimated ITR, P r(1[g opt (X i ) =ĝ(X i )]), by the empirical proportion of disagreement between the optimal rule and estimated ITR among the observations in the testing dataset. Since the above estimations of average population outcome and MCR under a rule are based on a large testing dataset, we consider them as fixed parameters of a treatment rule. That is, 'hat' notation is not used for these estimates based on the large datasets.
Note that in a real data analysis, g opt (X i ) is not known, and thus MCR cannot be computed to evaluate the performance of a ITR. In contrast, E[Y (ĝ(X)] can still be estimated to evaluate the performance of an ITR. In Section 3.3, we discussed the K-fold cross-validation (CV) estimates and OOB estimates for E[Y (ĝ(X)]. In this simulation, we also compareÊ[Y (ĝ(X))] from the large testing dataset to OOB estimates, naive estimates and CV estimates to numerically assess the usage of OOB estimates in the context of using W-RF to derive ITR.
In this simulation study, we also assess the variable importance measures (VIs) defined in Section 3.4. Recall that we discussed 2 types of VIs, which are numeric value for each biomarker. The ranking of VI of each biomarker provides the relative relevance of the biomarker to treatment selection. In our simulation setting, V I 1 (X 1 ) and V I 1 (X 2 ) should have higher rankings than all other V I 1 (X j ). Among V I 2 (X j ), V I 2 (X 1 ) and V I 2 (X 2 ) should have lowest rankings since larger value of Y is desired. W-RF results in a set of VIs based on every Monte Carlo dataset with sample size of 100. In addition to side-by-side boxplots of VIs to summarize the result of VIs from 500 Monte Carlo datasets, we use a summary statistic for ranking of VIs. Let us denote U (V I) as follows: U (V I 1 ) = P r[V I 1 (X j ) > V I 1 (X j )] for j ∈ {1, 2} and j ∈ {3, ..., p}, U (V I 2 ) = P r[V I 2 (X j ) < V I 2 (X j )] for j ∈ {1, 2} and j ∈ {3, ..., p}. U (V I k ) can be estimated by sample proportion. If U (V I k ) is near 1, it suggests that V I k performs well in selecting informative biomarkers among multiple irrelevant baseline covariates. U (V I k ) can be thought as the area under the receiver operating characteristic (ROC) curve, where X 1 , X 2 are 'cases', and the rest of X are 'controls' like in diagnostic setting. A similar ROC curve analysis has been used to summarize the ranking of variable importance measures in a tree-based model in a previous study (Huang and Wang, 2012).

Model Performance Comparisons
The results of the bias (i.e., E[Ỹ (g opt (X))] − E[Ỹ (ĝ(X))]) and MCR of Scenario I, II and III are summarized in Table 4.1, Table 4.2 and Table 4.3 respectively. Smaller values of the bias and MCR indicate better performance of the models. Firstly, we notice that for all three methods under the weighted classification framework, results from methods using AIPWE for ∆(X i ) are uniformly better than results from methods using IPWE for ∆(X i ) . This observation that IPWE is less efficient and stable agrees with previous results (Zhang et al., 2012a). Hereafter, the results from the methods under the weighted classification framework refer to results using AIPWE for ∆(X i ). In general, the performance of the models from subsetting B of each scenario is worse than that in subsetting A, the subsetting with fewer noisy covariates.
In Scenario I (I-A and I-B), where the decision boundary is linear in X 1 and X 2 , PLS has the best performance with respect to minimization of the bias and MCR. This is because that the optimal ITR is within the same class of PLS, and PLS has the variable selection features. W − logistic − l 1 has the second best performance in Scenario I-A and I-B for the same reasons. Direct-RF has the third and the fourth best performance in Scenario I-A and I-B. The comparison between PLS and Direct-RF in Scenario I suggests that flexibility of Direct-RF is at cost of performance when the optimal decision boundary is coincident with the decision boundary of PLS. In Scenario I, except for W − logistic − l 1 , the methods under the weighted classification framework have moderate performance with MCR around 0.20.
In scenarios I-A, W-RF does not outperform the other methods. However, as we increase the number of irrelevant covariates in Scenario I-B, our proposed method, W-RF, suffers less from the 'curse of dimensionality' and outperforms HF-Linear, HF-RBF and Direct-RF.
Similar pattern is observed in Scenario III, where Direct-RF and W-RF have almost the same performance in lower dimensional subsetting (Scenario III-A) but W-RF outperforms Direct-RF in higher dimensional subsetting (Scenario III-B).
The decision boundary of the optimal ITR in Scenario II is a circle, which cannot be approximated well by a linear decision boundary (see Figure 4.1). Therefore, PLS has worst performance (MCR is about 0.65), and Direct-RF has reasonable performance due to its flexibility. For the same reason, HF-RBF performs better than HF-linear and W − logistic − l 1 . In this scenario, W-RF outperforms every other method by a large margin both in low and high dimension subsettings.
W-RF also has best performance in Scenario III-A and Scenario III-B. Although the decision boundary of the optimal ITR in Scenario III is not linear in X 1 and X 2 , it can be approximated by a linear boundary to some extent (see Figure 4.1). Therefore, PLS has reasonable performance, which is not affected by the increasing dimensionality because of its variable selection feature.

OOB Estimation Of Population Average Outcome
Above results of model assessment are based on the large testing set (N = 10, 000) and the knowledge of optimal treatment rule, which give the most objective and reliable assessment of different methods. However, this approach of assessment is only applicable in simulation studies. To investigate model assessment methods applicable in real data analysis, we further compare the estimate of E[Y (ĝ(X))] based on the large testing set, the original training set, i,e. the naive estimate, and estimates of E[Y (ĝ(X))] from OOB and 5-fold CV procedure.
The results of the comparison are summarized in Table 4.4. It is obvious that the naive estimates are overoptimistic due to the overfitting bias. The average of estimates from 5-fold CV procedure are uniformly lower than the average of estimates from the large testing set and the average of OOB estimates because only 80% of training data are used to derive the ITR.  VIs can identify the predictive biomarkers, (X 1 , X 2 ). We notice that V I 1 works extremely well in separating (X 1 , X 2 ) from the rest of X, and we plot the logarithm of V I 1 for better visual display. Table 4   Chapter 5

Data Example
In this Chapter, we analyze data from Clinical Antipsychotic Trials of Intervention Effectiveness Alzheimer's Disease Study (CATIE-AD) as an illustration of our proposed method: W-RF. CATIE-AD is a double-blinded randomized clinical study to assess the effectiveness of atypical antipsychotic drugs in patients with Alzheimer's disease (Schneider et al., 2006).
CATIE-AD has four phases. In Phase 1, subjects were randomly assigned to receive olanzapine, quetiapine, resperdone, or placebo in a 2:2:2:3 ratio. After the first two weeks, physicians could discontinue the treatment and move a patient to the next phase if the patient's response was not adequate. One of the primary outcomes in Phase 1 is response rate at 12 weeks. Patients who were still in Phase 1 and with at least minimal improvement on the Clinical Global Impression of Change (CGIC) scale at 12 weeks were considered responding to the treatment.
In this data analysis example, we apply W-RF to analyze data from patients in olanzapine arm and placebo arm to derive an ITR to maximize patients' response rate based on baseline covariates. We consider a wide range of baseline characteristics recorded in CATIE-AD, including demographic characteristics, such as gender, age and martial status, clinical characteristics, such as Mini-Mental State Examination total score, and many com-mon laboratory tests. Covariates are excluded if the measurements were not taken at baseline for about 10% or higher proportion of patients (e.g., Alzheimer's Disease Assessment Scale total score is excluded from the analysis).
In total, there were 49 baseline covariates recorded for most of the patients in the study. After excluding the 25 subjects with missing values in any of the 49 covariates (12 subjects in olanzapine arm and 13 subjects in placebo arm), there are 87 subjects in olanzapine arm and 126 subjects in placebo arm. In total, 213 subjects are included in the analysis. Among these 213 subjects, the response rate is 0.29 in olanzapine arm and 0.21 in placebo arm.
We consider all 49 covariates as potential predictive markers. We apply our method,  We also assess each covariate by the first type variable importance measure (V I 1 ).
Based on V I 1 , baseline chloride lab test (CHLORIDE), baseline Brief Psychiatric Rating Scale (BPRS), baseline lactate dehydrogenase lab test (LDH) and baseline Neuropsychiatric Inventory Score (NPIS) are shown to be more relevant in deriving optimal ITR than other covariates (See Figure 5.1). LDH  NPIS  AST  ALT  TOTAL_BILIRUBIN  GGT  HEMOGLOBIN  N_MED  BICARBONATE  GLUCOSE  TOTAL_PROTEIN  MONOCYTES  MMSE  POTASSIUM  RBC  HEMATOCRIT  TSH  PHOSPHORUS  EOSINOPHILS  ALKALINE_PHOSPHATASE  AGE  MCH  SODIUM  CREATININE  MCV  MCHC  TRIGLYCERIDES  CPK  BMI  CALCIUM  B1_WT  ALBUMIN  HDL_CHOLESTEROL  BASOPHILS  NEUTROPHILS  ADL  WBC  LYMPHOCYTES  TOTAL_CHOLESTEROL  URIC_ACID  BUN  ALLCRIT  RESIDNCE  GENDER  EDUCATN  WHITE  MARITAL2 Variable Importance In addition, we also construct the CATE curves (Han et al., 2015) to marginally assess these four covariates for treatment selection. β(X) is the covariate-specific coefficient of treatment indicator in the varying-coefficient model with the logit link. As we discussed in Section 2.2, we should recommend to treat patients with CATE curve above horizontal line β(X) = 0 and not to treat patients with CATE curve below horizontal line β(X) = 0.

CHLORIDE BPRS
Based on the CATE curves ( Figure 5.2), we might not recommend to treat patients with serum chloride and lactate dehydrogenase (LHD) outside the normal range (the normal range for chloride: 96 to 106 mEq/L; the normal range for LDH 105 -333 IU/L). Also, patients with severe symptoms might be more likely to be responsive to the treatment because CATE curves are above horizontal line β(X) = 0 when BPRS scores and NPIS scores are higher, which indicate severe symptoms. However, we should note that all 95% simultaneous confidence bands of CATE curves contain horizontal line β(X) = 0, which suggest a 'statistically non-significant result', and we should be cautious about these patterns without further investigations and more evidence.  Chapter 6

Discussion
In this thesis, we review the current statistical researches on individualized treatment rules.
We review causal interpretations of the optimal treatment rule based on ∆(X) and an intu- (Direct-RF) shows the advantage of using the weighted classification framework.
Moreover, in our simulation study, our proposed variable importance measures, especially V I 1 , show extraordinary performance in ranking candidate biomarkers according to their relevance to optimizing the estimated average clinical outcome under the treatment rule. Our VIs provide a ranking without an explicit cut-off point to select the relevant baseline covariates. Further research is warranted to construct a systematic variable selection algorithm in W-RF.
It should be clear that the application of either our proposed method or many other methods discussed in this thesis to derive an ITR is an exploratory data analysis. Further confirmatory studies are needed for the estimated ITR before the ITR is used in practice.
In addition to derive an ITR, one major goal of this exploratory data analysis is to motivate further clinical research in treatment heterogeneity. As in all other applications of random forests, the superior performance of random forests is at a cost of interpretability. The ITR from 'black box' models, including W-RF and models with non-linear kernels, will not provide any etiological knowledge or biological mechanism on the treatment heterogeneity.
The lack of interpretability makes the variable importance measure in W-RF particularly useful in identifying the potential predictive biomarkers and generating scientific hypothesis regarding their associations with treatment heterogeneity. #f o r m a t i n g t h e f i n a l t r e e model i n t h e matrix format t r e e . matrix<−t r e e . matrix [ , c ( " node " , " f e a t u r e " , " c u t o f f " , " l e f t . node " , " r i g h t . node " , " a " , " t e r m i n a l " ) ] t r e e . matrix<−t r e e . matrix [ t r e e . matrix [ , " t e r m i n a l " ] ! = − 1 , ] t r e e . matrix [ i s . na ( t r e e . matrix )]<− −1 i f ( c l a s s ( t r e e . matrix)=="numeric " ) { t r e e . matrix = t ( a s . matrix ( t r e e . matrix ) ) } r e t u r n ( l i s t ( t r e e . matrix , v i ) ) } c . mytree<−cmpfun ( mytree )