In the paper "Ensemble Methods for Survival Function Estimation with Time-Varying Covariates", we generalize the conditional inference and relative risk forests to allow time-varying covariates and propose two forest algorithms CIF-TV
and RSF-TV
. The proposed methods by design can handle survival data with all combinations of left-truncation and right-censoring in the survival outcome, and with both time-invariant and time-varying covariates. We name the methods LTRC CIF
and LTRC RRF
when referring to the conditional inference forest and relative risk forest, respectively.
The pkg folder contains the R package LTRCforests, available on CRAN.
We here provide analysis codes for the paper, as well as the analysis codes for time-invariant covariate data, in analysis folder. Supplemental Material for the paper can be found in the subfolder doc.
We provide analysis codes for "Ensemble Methods for Survival Data with Time-Varying Covariates", as well as the analysis codes for time-invariant covariate data.
The analysis folder provides analysis codes in the paper:
-
The subfolder data contains
- functions to create simulated dataset with time-varying covariates
- functions to create simulated LTRC dataset with time-invariant covariates
- a function to obtain the real dataset.
-
The subfolder codes contains the functions to reproduce the analysis:
- simulations_tvary.R -- codes to reproduce results for simulated datasets with time-varying covariates
- simulations_tindepLTRC.R -- codes to reproduce results for simulated LTRC datasets with time-invariant covariates
- plot_and_tables_tvary.R -- codes to reproduce plots and tables for simulated datasets with time-varying covariates
- plot_and_tables_tindepLTRC.R -- codes to reproduce plots and tables for simulated LTRC datasets with time-invariant covariates
- realsetPBC.R -- analysis of real dataset (including functions to reproduce plots)
In particular, simulations_tvary.R and simulations_tindepLTRC.R provide results
- to compare performance of LTRC forests with default parameter settings and proposed parameter settings.
- to evaluate performance comparison for the four methods, the
Cox
model,CIF
,RRF
andTSF
(all forests trained with proposed parameter settings) - to choose methods by using IBS-based 10-fold CV, and compare the results produced by the selection rule with the best method.
-
The subfolder utils contains the source functions used to perform the analysis in the folder codes, including the functions to compute the integrated L2 difference.
Main analysis for applying the methodology on time-varying covariate data have been provided in the paper "Ensemble Methods for Survival Data with Time-Varying Covariates". Here we provide some more detailed information as supplemental material.
The values of mtry
can be fine-tuned using the "out-of-bag" observation. The simulation results have shown that it can greatly improve the forest performance over the default setting. See the following figure for the performance comparisons using CIF-TV
for different values of mtry
vs. the optimal one (Opt
) vs. the one tuned by the tuning algorithm (Tuned
).
See Figure for the similar results using RRF-TV
and Figure for TSF-TV
.
Throughout the experiments, we use ntree=100L
for all forest ensembles. It has been recommended that a random forest should have a number of trees between 64 and 128 trees (see Lecture Notes in Computer Science). It is true that generally more trees will result in better accuracy. However, more trees also means higher computational cost, and after a certain number of trees, the improvement is negligible. See the following figure for performance comparisons for different numbers of trees built in the forest methods.
In forest-like algorithms, bootstrapped samples are typically used to construct each individual tree to increase independence between these base learners. For time-varying covariate data, we have considered two different ways to bootstrap the observations:
- Bootstrapping pseudo-subjects. Namely, it is to bootstrap "independent" observations as the first step of any forest algorithm; this is because all pseudo-subjects are treated as independent observations in the recursive partitioning process;
- Bootstrapping subjects. It keeps all of the pseudo-subjects for each subject in the bootstrap sample.
Simulations have shown that the two different bootstrapping mechanisms do not result in fundamentally different levels of performance:
"Ensemble Methods for Survival Data with Time-Varying Covariates" mainly focuses on the analysis of the methodology applied on time-varying covariate data. There are certainly many situations in which only time-invariant (baseline) covariate information is available, and understanding the properties of different methods in that situation is important. In fact, our developed methodology and algorithms allow for estimation using the proposed forests for (left-truncated) right-censored data with time-invariant covariates.
In fact, the same data-driven guidance for tuning the parameters or selecting a modeling method also applies to the time-invariant covariates case (for both left-truncated right-censored survival data and right-censored survival data).
The following figures show how LTRC CIF
performs with different values of mtry
under the PH setting and non-PH setting, respectively. The datasets are generated with survival times following a Weibull-Increasing distribution, light (right-)censoring rate. This implies its broad effectiveness regardless of additional left-truncation and regardless of the presence of time-varying effects.
Figure 2.1. Integrated L2 difference of LTRC CIF with different mtry values distribution under the PH setting.
Figure 2.2. Integrated L2 difference of LTRC CIF with different mtry values distribution under the non-PH setting.
See below the boxplots of integrated L2 difference for performance comparison. Datasets are generated with time-invariant covariates, left-truncated right-censored survival times following a Weibull-Increasing distribution. The first row shows results for the number of subjects N=100
, second row for N=300
, third row for N=500
, bottom row for N=1000
; the first column shows results for linear survival relationship, second column for nonlinear, the third column for interaction. In each plot, LTRC CIF(P)
--LTRC CIF with proposed parameter settings; LTRC RRF(P)
--LTRC RRF with proposed parameter settings; LTRC TSF(P)
--LTRC TSF with proposed parameter settings; Opt--Best method; IBSCV
--Method chosen by IBS-based 10-fold CV.
Figure 3.3. Boxplots of integrated L2 difference for performance comparison on time-invariant covariate data.