# Chapter 7 Appendix 2: Prediction and features selection for survival endpoints

We turn our attention to survival analysis which deals with so-called time-to-event endpoints. We start with a very brief introduction to survival analysis and introduce the cox proportional hazards model. Then we turn to the high-dimensional setting and discuss elastic net regularization for cox regression. Finally, we introduce the time-dependent Brier score, a popular prediction performance measure in the survival setting, and we benchmark backward selection and survival random forest on a gene expression example.

## 7.1 Survival analysis and Cox regression

For subject $$i$$ we denote the event time with $$T_i$$. Typically we do not for each subject observe the event time as a subject may be censored due to:

• Loss to follow-up
• Withdrawal from study
• No event by end of fixed study period.

Therefore in practise we observe the survival time $$\widetilde{T}_i$$ (which equals the event time or the censoring time whichever occurs earlier) and the event indicator $$\delta_i$$ ($$\delta_i=1$$ in case of event $$=0$$ in case of censoring).

A fundamental quantity in survival analysis is the survival probability

$S(t)=P(T>t)=1-F(t).$

Another important quantitiy is the hazard function $$h(t)$$, or the instantaneous rate at which events occur, which is defined as

$h(t)=\lim_{dt\rightarrow 0}\frac{P(t\leq T < t+dt|T\geq t)}{dt}=-S'(t)/S(t).$

How do we estimate these quantities where the event time is only partially known? In the context of censoring the Kaplan-Meier method is the most popular way to estimate survival probabilities whereas the Cox proportional hazards model is commonly used to study the relationship between covariates and survival time. The latter Cox regression model assumes a semi-parametric form for the hazard

$h(t|X)=h_0(t)\exp(X^T\beta).$

$$h_0(t)$$ is the baseline hazard and $$\beta$$ are the regression coefficients. The regression coefficients are estimated by maximizing the log partial likelihood $$\ell(\beta|{\bf y,X})$$. Note that $$\bf y$$ represents the observed response, i.e. the tuples $$(\widetilde{T}_i,\delta_i)$$, $$i=1,\ldots,n$$.

To learn more about survival analysis I recommend to have a look at this short tutorial https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html.

## 7.2 Regularized Cox regression

In a high-dimension setting classical Cox regression falls short. Similar as in the previous sections we can use subset selection or regularization. The R package glmnet implements elastic net penalized cox regression.

For illustration we use the Lymphoma data set which consists of gene expression data for $$p=7399$$ genes measured on $$n=240$$ patients, as well as censored survival times, for these patients. We start by reading the data.

# read gene expression matrix
as.matrix
topvar.genes <- order(apply(x,2,var),decreasing=TRUE)[1:500]
x <- scale(x[,topvar.genes])

# read survival time and event variable
as.matrix

The first column of y is the survival time and the second column the event indicator.

head(y)
##      time status
## [1,]  5.0      0
## [2,]  5.9      0
## [3,]  6.6      0
## [4,] 13.1      0
## [5,]  1.6      1
## [6,]  1.3      1

Next we plot the estimated survival curve.

dat <- data.frame(cbind(y,x))
fit.surv <- survfit(Surv(time, status) ~ 1, data = dat)
ggsurvplot(fit.surv)

We can use the summary function to obtain more specific information on estimated survival quantities, e.g. the probability of surviving beyond 10 years is obtain as follows.

summary(survfit(Surv(time, status) ~ 1, data = dat), times = 10)

We are interested to find the genes which are most predictive for survival. We assume proportionality of hazards and use glmnet to perform L1-penalized cox regression. (Note that we only consider $$p=500$$ genes with largest variance.)

set.seed(1)
fit.coxnet <- glmnet(x, y, family = "cox",alpha=0.95)
plot(fit.coxnet,xvar="lambda")

To identify the optimal lambda we use cross-validation and take Harrel’s concordance index as a goodness of fit measure.

cv.coxnet <- cv.glmnet(x,y,
family="cox",
type.measure="C",
alpha=0.95)
plot(cv.coxnet)

The C-index ranges from 0.5 to 1. A value of 0.5 means that the model is no better than predicting an outcome than random chance. The tuning parameter which maximizes the C-index is $$\lambda_{\rm{opt}}=$$ 0.05. The next graphic shows the magnitude of the non-zero coefficients (note that we standardized the input covariates).

## 7.3 Brier score and predictions in survival analysis

Evaluating the predictive performance in the linear regression and classification context was straightforward. For time-to-event data this is more involved. A popular score to measure the prediction accuracy is the time-dependent Brier score

${\rm BS}(t,\hat{S})={\bf E}(Y_{\rm{new}}(t)-\hat{S}(t|X_{\rm new}))$

where $$Y_{\rm{new}}(t)={\bf 1}(T_{\rm new}\geq t)$$ is the true status of a new test subject and $$\hat{S}(t|X_{\rm new})$$ is the predicted survival probability. Calculation of the Brier score is complicated by the fact that we do not always observe the event time $$T$$ due to censoring. The R package pec estimates the Brier score using a technique called Inverse Probability of Censoring Weighting (IPCW).

We split the data set into training and test data.

train_ind <- sample(1:nrow(x),size=nrow(x)/2)
xtrain <- x[train_ind,]
ytrain <- y[train_ind,]
xtest <- x[-train_ind,]
ytest <- y[-train_ind,]
dtrain <- data.frame(cbind(ytrain,xtrain))
dtest <- data.frame(cbind(ytest,xtest))

For illustration we use forward selection on the training data to obtain a prediction model.

fit.lo <- coxph(Surv(time,status)~1,data=dtrain,
x=TRUE,y=TRUE)
up <- as.formula(paste("~",
paste(colnames(xtrain),
collapse="+")))
fit.fw <- stepAIC(fit.lo,
scope=list(lower=fit.lo,
upper=up),
direction="forward",
steps=5,
trace=FALSE)

The following table summarizes the variables added in each step of the forward selection approach.

kable(as.data.frame(fit.fw$anova),digits=3) Step Df Deviance Resid. Df Resid. Dev AIC NA NA 70 601.608 601.608 + V3799 1 9.381 69 592.227 594.227 + V5296 1 8.879 68 583.348 587.348 + V5034 1 12.865 67 570.483 576.483 + V5063 1 10.764 66 559.719 567.719 + V4498 1 8.271 65 551.448 561.448 The next figure shows the estimated hazard ratios and confidence intervals. This type of plot is often referred to as forest plot. survminer::ggforest(fit.fw) Finally we use the pec package to calculate the Brier score on the training and test data. library(pec) fit.pec.train <- pec::pec( object=list("cox.fw"=fit.fw), data = dtrain, formula = Surv(time, status) ~ 1, splitMethod = "none") fit.pec.test <- pec::pec( object=list("cox.fw"=fit.fw), data = dtest, formula = Surv(time, status) ~ 1, splitMethod = "none") The following figure shows the Brier scores evaluated on training and test data. par(mfrow=c(1,2)) plot(fit.pec.train,main="training data") plot(fit.pec.test,main="test data") The plot on the right shows the Brier score on the test data and indicates that the selected model performs no better than the reference model (no covariates). The pec package can also be used to benchmark different prediction models. We illustrate this based on random forest and forward selection. In this illustration we do not split the data into training and test. Instead we use cross-validation to compare the two prediction approaches. We start by writing a small wrapper function to use forward selection in pec. selectCoxfw <- function(formula,data,steps=5,direction="both") { require(prodlim) fmlo <- reformulate("1",formula[[2]]) fitlo <- coxph(fmlo,data=data,x=TRUE,y=TRUE) fwfit <- stepAIC(fitlo, scope=list(lower=fitlo, upper=formula), direction=direction, steps=steps, trace=FALSE) if (fwfit$formula[[3]]==1){
newform <- reformulate("1",formula[[2]])
newfit <- prodlim(newform,
data=data)
}else{
newform <-fwfit$formula newfit <- coxph(newform,data=data,x=TRUE,y=TRUE) } out <- list(fit=newfit,In=attr(terms(newfit$formula), which = "term.labels"))
out\$call <-match.call()
class(out) <- "selectCoxfw"

out
}

predictSurvProb.selectCoxfw <- function(object,newdata,times,...){
predictSurvProb(object[[1]],newdata=newdata,times=times,...)
}

We run forward selection with maximum 10 steps.

dat.top50 <- dat[,1:52]
fm <- as.formula(paste("Surv(time, status) ~ ",
paste(colnames(dat.top50[,-(1:2)]),
collapse="+")))
fit.coxfw <- selectCoxfw(fm,data=dat.top50,direction="forward",steps=10)

We fit random forest using cforest from the party package.

fit.cforest <- pec::pecCforest(fm, data =dat.top50,
control = party::cforest_classical(ntree = 100))

We can obtain a measure of variable importance using the function varimp.

##
## Variable importance for survival forests; this feature is _experimental_

Finally we compare the two approaches using the cross-validated Brier score.

pec.cv <- pec::pec(
object=list("cox.fw"=fit.coxfw,"cforest"=fit.cforest),
data = dat.top50,
formula = Surv(time, status) ~ 1,
splitMethod = "cv10")
plot(pec.cv)

We conclude that forward selection and random forest perform similar as the reference model (no covariate included).

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.