Chapter 1 Concepts

library(tidyverse)
library(scales)
library(survival)
library(survminer)

This chapter introduces survival analysis concepts.

Let $T^*$ be a random variable representing the time until the event, and $U$ be a random variable representing the time until (right) censoring. The observed value is whichever event comes first, $T = \mathrm{min}(T^*, U)$. The time, $T$, and status, $\delta = I[T^* < U]$ (1 = censored, 2 = event), constitutes the response in a survival analysis. The survival::lung data set is typical, where time = $T$ and status = $\delta$, and the other variables are identifying features for the subject.

survival::lung %>% head()

##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0

Censoring sometimes occurs when subjects are monitored for a fixed period of time (Type I), the study is halted after a pre-specified level of events are reached (Type II), or the subject drops out for a reason other than the event of interest (random censoring).

You can specify the survival distribution either with a survival function, $S(t)$, or with a hazard function, $h(t)$². Let $F(t) = P(T \le t)$ be the cumulative risk function (aka, cumulative incidence), the probability of the event occurring on or before time $t$. $S(t)$ is its complement, $S(t) = 1 - F(t)$.

\[S(t) = P(T > t).\]

The hazard function is the instantaneous event rate at $t$ given survival up to $t$,

\[h(t) = \lim_{\delta \rightarrow 0}{\frac{P(t < T < t + \delta|T > t)}{\delta}}.\]

An instantaneous event rate has no intuitive appeal, but think of it in discrete time where $\delta > 0$. $h(t + \delta)$ is the conditional probability of an event at the discrete interval $t + \delta$, conditioned on those at risk during that interval.

The survival and hazard functions are related by the multiplication rule, $P(AB) = P(A|B)P(B)$. The event probability at $t$, $f(t) = F'(t)$, is the probability of the event at $t$ given survival up to $t$ (the hazard function) multiplied by the probability of survival up to $t$ (the survival function).

\[f(t) = h(t) S(t).\]

Again in discrete terms, survival up to interval $t$ is the sum product of the survival probabilities at each preceding period, $S(t) = \Pi_{i = 1}^t [1 - h(t)]$. It is the complement of the cumulative risk.

Rearranging, $h(t)dt = \frac{f(t)}{S(t)}dt$ describes the prognosis for a subject who has survived through time $t$.

$S(t)$ is also the negative exponent of the cumulative hazard function,

\[S(t) = e^{-H(t)}.\]

Taking the log and rearranging, $h(t) = \frac{d}{dt} [\log S(t)]$, the negative time derivative of the log cumulative hazard.

Use the survival function to estimate the mean survival time, $E(T) = \int S(t)dt$, and median survival time, $S(t) = 0.5$.

Take the exponential distribution as a quick example. It has a constant hazard, $h(t) = \lambda$. The cumulative hazard is $H(t) = \int_0^t \lambda du = \lambda t$. The survival function is $S(t) = e^{-\lambda t}$. The probability of failure at time $t$ is $f(t) = \lambda e^{-\lambda t}$. The expected time to failure is $E(t) = \int_0^\infty e^{-\lambda t} dt = 1 / \lambda$, and the median time to failure is $S(t) = e^{-\lambda t} = .5$, or $t_{med} = \log(2) / \lambda$.

There are parametric and non-parametric methods to estimate a survival curve. The usual non-parametric method is the Kaplan-Meier estimator. The usual parametric method is the Weibull distribution. In between is the most common way to estimate a survivor curve, the Cox proportional hazards model.

1.1 Log-Rank Test

It is not obvious how to compare two survival distributions because they can cross, diverge, etc. When observations do not follow a parametric distribution function, compare them with the non-parametric log-rank test. The alternative hypothesis, termed the Lehmann alternative, is that one survival distribution is uniformly higher than the other, $H_A : S_1(t) = [S_0(t)]^\psi$, or equivalently, the hazard functions are proportional, $h_1(t) = \psi h_0(t)$, with $H_A: \psi \ne 1$.

At each $t$, you could construct a 2x2 contingency table between event/no-event and curves A and B.

	Curve A	Curve B	Total
Event	$d_{Ai}$	$d_{Bi}$	$d_i$
No Event	$n_{Ai} - d_{0i}$	$n_{Bi} - d_{1i}$	$n_i - d_i$
Total	$n_{Ai}$	$n_{Bi}$	$n_i$

Holding the margins as fixed, the probability of observing $d_{Ai}$ events in curve A at time $i$ follows a hypergeometric distribution.

\[f(d_{Ai} | n_{Ai}, n_{Bi}, d_i) = \frac{{{n_{Ai}}\choose{d_{Ai}}}{{n_{Bi}}\choose{d_{Bi}}}}{{n_i}\choose{d_i}}\]

The expected value is $e_{Ai} = E(d_{Ai}) = \frac{d_i}{n_i} \cdot n_{0i}$ with variance $v_{Ai} = Var(d_{Ai}) = d_{i} \cdot \frac{n_{Ai}}{n_i} \cdot \frac{n_{1i}}{n_i} \cdot \frac{n_i - d_i}{n_i - 1}$.

The log-rank test statistic is the sum of the differences between the observed and expected events, $U_0 = \sum (d_{Ai} - e_{Ai})$, normalized by dividing by the square-root of its variance, $V_0 = Var({U_0}) = \sum v_{Ai}$.

\[U = \frac{U_0}{\sqrt{V_0}} \sim N(0, 1)\]

$U^2$ is a chi-square random variable with one degree of freedom.

\[U^2 = \frac{U_0^2}{V_0} \sim \chi_1^2\]

lung_1 <- survival::lung %>%
  mutate(sex = factor(sex, levels = c(1, 2), labels = c("Male", "Female")))
(km_diff <- survdiff(Surv(time, status) ~ sex, data = lung_1))

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung_1)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=Male   138      112     91.6      4.55      10.3
## sex=Female  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

The p-value for $\chi_1^2$ = 10.3 is 1 - pchisq(km_diff$chisq, length(km_diff$n) - 1) = 0.001, so reject $H_0$ that males and females have identical survival patterns.

while the log-rank test can determine whether survival differs between groups, it does not estimate the effect size. It is a statistical, but not a clinical, assessment of the factor’s impact (Bradburn 2003).

References

Bradburn, Clark, M. J. 2003. “Survival Analysis Part II: Multivariate Data Analysis–an Introduction to Concepts and Methods.” British Journal of Cancer 89 (3). https://doi.org/10.1038/sj.bjc.6601119.

Many texts represent hazard with the $\lambda$ symbol instead of $h$.↩︎

	Curve A	Curve B	Total
Event	\(d_{Ai}\)	\(d_{Bi}\)	\(d_i\)
No Event	\(n_{Ai} - d_{0i}\)	\(n_{Bi} - d_{1i}\)	\(n_i - d_i\)
Total	\(n_{Ai}\)	\(n_{Bi}\)	\(n_i\)