# Chapter 20 Survival Analysis

## 20.1 What is survival data?

Time-to-event data that consists of a distinct start time and end time. These data are common in the medical field.

Examples from cancer:

• Time from surgery to death
• Time from start of treatment to progression
• Time from response to recurrence

### 20.1.1 Examples from other fields

Time-to-event data is common in many fields including, but not limited to:

• Crime: Time from release to criminal recidivism
• Marketing: Customer retention
• Engineering: Time to machine malfunction
• Finance: Time to loan default
• Management: Time to employee resignation/termination

### 20.1.2 Aliases for survival analysis

Because survival analysis is common in many other fields, it also goes by other names:

• Reliability analysis
• Duration analysis
• Event history analysis
• Time-to-event analysis

## 20.2 Questions of interest

The two most common questions I encounter related to survival analysis are:

1. What is the probability of survival to a certain point in time?
2. What is the average survival time?

## 20.3 What is censoring?

In survival analysis, it is common for the exact event time to be unknown, or unobserved, which is called censoring. A subject may be censored due to:

• Loss to follow-up
• Withdrawal from study
• No event by end of fixed study period

Specifically these are examples of right censoring. Other common types of censoring include:

• Left (you do not observed the beginning)
• Interval (individual is loss during the study)

### 20.3.1 Censored survival data

When the exact event time is unknown then some patients are censored, and survival analysis methods are required.

We can incorporate censored data using survival analysis techniques

Toy example of a Kaplan-Meier curve for this simple data (details to follow):

• Horizontal lines represent survival duration for the interval
• An interval is terminated by an event
• The height of vertical lines show the change in cumulative probability
• Censored observations, indicated by tick marks, reduces the cumulative survival between intervals

## 20.4 Danger of ignoring censoring

Case study: musicians and mortality

Conclusion: Musical genre is associated with early death among musicians.

Problem: this graph does not account for the right-censored nature of the data.

## 20.5 Components of survival data

For subject $$i$$:

1. Event time $$T_i$$

2. Censoring time $$C_i$$

3. Event indicator $$\delta_i$$:

• 1 if event observed (i.e. $$T_i \leq C_i$$)
• 0 if censored (i.e. $$T_i > C_i$$)
4. Observed time $$Y_i = \min(T_i, C_i)$$

## 20.6 Data example

### 20.6.1 Research question of interest

Organ Transplantation is limited by the supply of cadaveric and living donors.

Unfortunately, many people die while waiting on the list.

What types of characteristics lead to an increase or decrease likelihood of dying on the waiting list?

### 20.6.2 Data structure

data("transplant")
• 815 of people on the waiting list
• Outcome: Death
• Predictors: age, sex, Blood type, year added to list
transplant %>%
##   age sex abo year futime    event
## 1  47   m   B 1994   1197    death
## 2  55   m   A 1991     28      ltx
## 3  52   m   B 1996     85      ltx
## 4  40   f   O 1995    231      ltx
## 5  70   m   O 1996   1271 censored
## 6  66   f   O 1996     58      ltx

### 20.6.3 Variables:

• “futime”: the observed time $$Y_i = min(T_i, C_i)$$
• “event”: the event factor variable can take on censored, death, liver transpant, or withdraw
• “abo”: blood type A, B, AB, or O
• sex: Male or Female
• age: age at addition to the waiting list

### 20.6.4 Preparing data for analysis

Event indicator

Most functions used in survival analysis will also require a binary indicator of event that is:

• 0 for no event
• 1 for event

Currently our data example contains a factor variable indicating whehter the patient has died or if they have received a liver transpant, been withdrawn, or are censored.

transplant$delta<-0 transplant$delta[transplant$event=="death"]<-1 ## 20.7 Analyzing survival data ### 20.7.1 Questions of interest Recall the questions of interest: 1. What is the probability of surviving to a certain point in time? 2. What is the average survival time? ### 20.7.2 Creating survival objects The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs. • The Surv function from the survival package creates a survival object for use as the response in a model formula. • There will be one entry for each subject that is the survival time, which is followed by a + if the subject was censored. Let’s look at the first 10 observations: survival::Surv(transplant$futime, transplant$delta)[1:10] ## [1] 1197 28+ 85+ 231+ 1271+ 58+ 392+ 30+ 12+ 139+ ### 20.7.3 Estimating survival curves with the Kaplan-Meier method • The survival::survfit function creates survival curves based on a formula. Let’s generate the overall survival curve for the entire cohort, assign it to object f1, and look at the names of that object: f1 <- survfit(Surv(transplant$futime, transplant$delta) ~ 1, data = transplant) names(f1) ## [1] "n" "time" "n.risk" "n.event" "n.censor" "surv" ## [7] "std.err" "cumhaz" "std.chaz" "type" "logse" "conf.int" ## [13] "conf.type" "lower" "upper" "call" Some key components of this survfit object that will be used to create survival curves include: • time, which contains the start and endpoints of each time interval • surv, which contains the survival probability corresponding to each time ### 20.7.4 Kaplan-Meier plot - base R Now we plot the survfit object in base R to get the Kaplan-Meier plot: plot(f1, xlab = "Days", ylab = "Overall survival probability") • The default plot in base R shows the step function (solid line) with associated confidence intervals (dotted lines). Note that the tick marks for censored patients are not shown by default, but could be added using mark.time = TRUE ### 20.7.5 Estimating survival time One quantity often of interest in a survival analysis is the probability of surviving a certain number $$(x)$$ of years. For example, to estimate the probability of surviving to 5 years, use summary with the times argument: summary(f1, times = 5*365) ## Call: survfit(formula = Surv(transplant$futime, transplant$delta) ~ ## 1, data = transplant) ## ## time n.risk n.event survival std.err lower 95% CI upper 95% CI ## 1825 2 66 0.49 0.206 0.215 1 We find that the 5-year probability of survival in this study is 49%. The associated lower and upper bounds of the 95% confidence interval are also displayed. ### 20.7.6 Survival time is often estimated incorrectly What happens if you use a “naive” estimate? 66 of the 815 patients died by 5 years so: $\Big(1 - \frac{66}{815}\Big) \times 100 = 92\%$ • You get an incorrect estimate of the 5-year probability of survival when you ignore the fact that 747 patients were censored before 5 years. • Recall the correct estimate of the 5-year probability of survival was 49%. ### 20.7.7 Estimating median survival time Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median (survival times are not expected to be normally distributed so the mean is not an appropriate summary). We can obtain this directly from our survfit object: survival::survfit(survival::Surv(futime, delta) ~ 1, data = transplant) ## Call: survfit(formula = survival::Surv(futime, delta) ~ 1, data = transplant) ## ## n events median 0.95LCL 0.95UCL ## [1,] 815 66 1422 1422 NA round(summary(f1)$table["median"]/365, 1)

We see the median survival time is 3.9 years. The lower and upper bounds of the 95% confidence interval are also displayed.

### 20.7.8 Median survival is often estimated incorrectly

What happens if you use a “naive” estimate?

Summarize the median survival time among the 66 patients who died:

transplant$futime[transplant$delta == 1] %>%
median
## [1] 65.5
• You get an incorrect estimate of median survival time of 0.2 years when you ignore the fact that censored patients also contribute follow-up time.

• Recall the correct estimate of median survival time is 3.9 years.

## 20.8 Comparing survival times between groups

### 20.8.1 Questions of interest with respect to between-group differences

Is there a difference in survival probability between groups?

From our example: does the probability of survival differ according to gender among liver transplant patients?

### 20.8.2 Kaplan-Meier plot by group

We can add a covariate to the right-hand side of the survival::survfit object to obtain a stratified Kaplan-Meier plot.

Let’s also look at some other customization we can do with survminer::ggsurvplot.

survminer::ggsurvplot(
fit = survival::survfit(survival::Surv(futime, delta) ~ abo, data = transplant),
xlab = "Days",
ylab = "Overall survival probability",
legend.title = "Blood Type",
legend.labs = c("A", "B", "AB","O"),
break.x.by = 100,
censor = FALSE,
risk.table = TRUE,
risk.table.y.text = FALSE)

• Blood type B patients have the lowest overall survival probability.
• The risk table below the plot shows us the number of patients at risk at certain time points, which can give an idea of how much information is being used to calculate the estimates at each time
• Male patients have the lowest overall survival probability.

## 20.9$$x$$-year survival probability by group

As before, we can get an estimate of, for example, 1-year survival by using summary with the times argument in our survival::survfit object:

summary(survfit(Surv(futime, delta) ~ sex, data = transplant), times = 365)
## Call: survfit(formula = Surv(futime, delta) ~ sex, data = transplant)
##
##                 sex=m
##         time       n.risk      n.event     survival      std.err lower 95% CI
##     365.0000      82.0000      37.0000       0.8641       0.0244       0.8175
## upper 95% CI
##       0.9134
##
##                 sex=f
##         time       n.risk      n.event     survival      std.err lower 95% CI
##     365.0000      59.0000      22.0000       0.8763       0.0276       0.8238
## upper 95% CI
##       0.9322

## 20.10 Log-rank test for between-group significance test

• We can conduct between-group significance tests using a log-rank test.
• The log-rank test equally weights observations over the entire follow-up time and is the most common way to compare survival times between groups.
• There are versions that more heavily weight the early or late follow-up that could be more appropriate depending on the research question.

We get the log-rank p-value using the survival::survdiff function:

survival::survdiff(Surv(futime, delta) ~ sex, data = transplant)
## Call:
## survival::survdiff(formula = Surv(futime, delta) ~ sex, data = transplant)
##
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=m 447       42     36.4     0.861      1.93
## sex=f 368       24     29.6     1.059      1.93
##
##  Chisq= 1.9  on 1 degrees of freedom, p= 0.2

And we see that the p-value is .2, indicating no significant difference in overall survival according to gender.

## 20.11 The Cox regression model

We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables.

The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.

Some key assumptions of the model:

• non-informative censoring
• proportional hazards

### 20.11.1 Cox regression example using a single covariate

We can fit regression models for survival data using the survival::coxph function, which takes a survival::Surv object on the left hand side and has standard syntax for regression formulas in R on the right hand side.

survival::coxph(survival::Surv(futime, delta) ~ factor(abo), data = transplant)

We can see a tidy version of the output using the tidy function from the broom package:

broom::tidy(survival::coxph(survival::Surv(futime, delta) ~ factor(abo),
data = transplant))
## # A tibble: 3 × 5
##   term          estimate std.error statistic p.value
##   <chr>            <dbl>     <dbl>     <dbl>   <dbl>
## 1 factor(abo)B    0.183      0.386    0.474    0.636
## 2 factor(abo)AB   0.195      0.618    0.316    0.752
## 3 factor(abo)O   -0.0110     0.284   -0.0388   0.969

### 20.11.2 Hazard ratios

The quantity of interest from a Cox regression model is a hazard ratio (HR).

If you have a regression parameter $$\beta$$ (from column estimate in our survival::coxph) then HR = $$\exp(\beta)$$.

For example, from our example we obtain the regression parameter $$\beta_1=0.195$$ for AB vs B blood type, so we have HR = $$\exp(\beta_1)=1.21$$.

A HR < 1 indicates reduced hazard of death whereas a HR > 1 indicates an increased hazard of death.

So we would say that a person with AB blood type has 1.21 times increased hazard of death as compared to a person with A blood type.

## 20.12 Summary

• Time-to-event data is common
• Survival analysis techniques are required to account for censored data
• The survival package provides tools for survival analysis, including the Surv and survfit functions
• The survminer package allows for customization of Kaplan-Meier plots based on ggplot2
• Between-group comparisons can be made with the log-rank test using survival::survdiff
• Multiavariable Cox regression analysis can be accomplished using survival::coxph

## 20.13 Empoyee Attrition Example

# First let's import the data
q = read.csv('turnover2.csv', header = TRUE, sep = ",", na.strings = c("",NA))

Our case uses data of 1785 employees.

### 20.13.1 Variables:

$$exp$$ - length of employment in the company

$$event$$ - event (1 - terminated, 0 - currently employed)

$$branch$$ - branch

$$pipeline$$ - source of recruitment

Please note that the data is already prepared for survival analysis. Moreover, length of employment is counted in months up to two decimal places, according to the following formula: (date fire - date hire) / (365.25 / 12).

### 20.13.2 Cox Model

 Dependent variable: exp branchfirst -0.089 (0.164) branchfourth -0.529c (0.176) branchsecond -0.424c (0.164) branchthird -0.466b (0.210) pipelineea -0.214 (0.267) pipelinejs 0.503b (0.201) pipelineref 0.322 (0.214) pipelinesm 0.361 (0.258) Observations 1,785 R2 0.037 Max. Possible R2 1.000 Log Likelihood -6,767.031 Wald Test 63.260c (df = 8) LR Test 67.275c (df = 8) Score (Logrank) Test 64.850c (df = 8) Note: p<0.1; p<0.05; p<0.01

## 20.14 Accelerated failure-time models

### 20.14.1 survreg(formula, dist=‘weibull’)

An accelerated failure-time (AFT) model is a parametric model with covariates and failure times following the survival function of the form $$S(x|Z) = S_0 (x exp[\theta Z])$$, where $$S_0$$ is a function for the baseline survival rate. The term $$exp[\theta Z]$$ is called the acceleration factor.

The AFT model uses covariates to place individuals on different time scales { note the scaling by the covariates in $$S(t|Z)$$ via $$exp[\theta Z]$$. The AFT model can be rewritten in a log-linear form, where the log of failure time (call this logX) is linearly related to the mean $$\mu$$, the acceleration factor, and an error term $$\sigma W$$: $log X = \theta'Z+\sigma W$

### 20.14.2 Different AFT Models

There are several AFT models depending on the assumption about the distribution of the error $$W$$.

Distribution df Included in Survival
Exponential 1 yes
Weibull 2 yes
lognormal 2 yes
log logistic 2 yes
generalized gamma 3 no

### 20.14.3 Weibull

The Weibull distribution has a survival function equal to $S(t)=exp(-(\lambda t)^\rho)$

and a hazard function equal to $\Lambda(t)=\rho \lambda(\lambda t)^{\rho-1}$

where $$\lambda>0$$ and $$\rho>0$$. A special case of the Weibul functions is the exponential distribution when $$\rho=1$$.

If $$\rho > 1$$, then the risk increases over time.

If $$\rho < 1$$, then the risk decreases over time.

### 20.14.4 Weibul Model in R

The survial package allows us to estimate all of these models. We will use the same model with a small change to estimate a Weibul AFT model.

w2 = survreg(Surv(exp, event) ~ . , data = q, dist = "weibul")
w3 = survreg(Surv(exp, event) ~ . , data = q, dist = "exponential")
stargazer::stargazer(w2,w3,type="text",style = "qje",single.row = TRUE, star.char = c("a","b","c"))

### 20.14.5 Weibul Model in R

 exp Weibull exponential (1) (2) branchfirst 0.142 (0.147) 0.119 (0.164) branchfourth 0.494*** (0.157) 0.524*** (0.176) branchsecond 0.433*** (0.146) 0.441*** (0.163) branchthird 0.434** (0.189) 0.469** (0.210) pipelineea 0.228 (0.239) 0.239 (0.267) pipelinejs -0.529*** (0.180) -0.537*** (0.200) pipelineref -0.383** (0.192) -0.372* (0.213) pipelinesm -0.376 (0.232) -0.386 (0.258) Constant 1.831*** (0.220) 1.860*** (0.245) N 1,785 1,785 Log Likelihood -2,712.426 -2,721.241 chi2 (df = 8) 74.713*** 69.134*** Notes: ***Significant at the 1 percent level. **Significant at the 5 percent level. *Significant at the 10 percent level.

### 20.14.6 Do we need accelerated time?

The exponential model assumes a constant effect of time.

We can perform a likelihood ratio test between the exponential and Weibul model to see if AFT is neccessary.

library(lmtest)
lrtest(w2,w3)
## Likelihood ratio test
##
## Model 1: Surv(exp, event) ~ branch + pipeline
## Model 2: Surv(exp, event) ~ branch + pipeline
##   #Df  LogLik Df Chisq Pr(>Chisq)
## 1  10 -2712.4
## 2   9 -2721.2 -1 17.63  2.683e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Although there is little difference in the parameter estimates, the likelihood ratio test suggest we should use the Weibul model over the exponential.

### 20.14.7 This is only the beginning

• Machine learning methods are improving on survival models
• We have only discussed X’s that are constant over time, but what about time-varying covariates (or worse yet time-dependent covariates)
• There are also issues of competing risks