A short course on Survival Analysis applied to the Financial Industry

2.2 Pointwise confidence interval for \(S(t)\)

For the contruction of the confidence interval for the estimated survival we can use a well-know estimator of the variance, the Greenwood estimator (Greenwood 1926). The Greenwood variance estimate for a Kaplan-Meier curve is defined as

\[ \hat \sigma^2[\hat S(t)] = \widehat var[\hat S(t)] = \hat S(t)^2 \sum_{i:t_i \le t} \frac{d_i}{n_i(n_i-d_i)} \]

In case of no censoring, this estimator reduces to

\(\hat \sigma^2[\hat S(t)] = \frac{\hat S(t) [1- \hat S(t)]}{n}\).

It is possible to use this estimator to derive a confidence interval for all time points \(t\). Assuming asintotic normality (\(\hat S(t) \simeq N(\hat S(t), \sigma(t)/\sqrt(n))\)) and let \(\sigma\) denotes the Greenwood’s standard deviation. Then confidence intervals for the survival function are then computed as follows (plain)

\[ \bigg(\hat S(t) \pm z_{1-\alpha/2} \cdot \hat \sigma/\sqrt(n) \bigg), \] where \(\hat \sigma = se(\hat S(t))\) is calculated using Greenwood’s formula.

It is important to hightlight here that this confidence interval may be out of the (0,1) interval! For solve this, the approximation to the normal distribution is improved by using the log-minus-log transformation

\[ \bigg(\hat S(t) \pm e^{z_{1-\alpha/2} \cdot \frac{\hat\sigma}{\hat S(t) ln \hat S(t)}} \bigg). \]

Other options include the log transformation \[ \exp \bigg( \ln(\hat S(t)) \pm z_{1-\alpha/2} \cdot \hat\sigma/ \hat S(t) \bigg). \]

In R we can select these options as: log(default), log-log and plain.

km1 <- survfit(Surv(time, status) ~ 1, data = loan_filtered) # conf.type = "log" (default) 
summary(km1, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   200   4207     201    0.955 0.00309        0.949        0.961
##  1100    143    1130    0.626 0.01369        0.600        0.653

km2 <- survfit(Surv(time, status) ~ 1, data = loan_filtered, conf.type = "plain") 
summary(km2, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered, 
##     conf.type = "plain")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   200   4207     201    0.955 0.00309        0.949        0.961
##  1100    143    1130    0.626 0.01369        0.599        0.653

km3 <- survfit(Surv(time, status) ~ 1, data = loan_filtered, conf.type = "log-log") 
summary(km3, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered, 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   200   4207     201    0.955 0.00309        0.949        0.961
##  1100    143    1130    0.626 0.01369        0.598        0.652

See arguments times and censored of the function summary.survfit.

And now… what about the empirical distribution (without taking into account the censored data)? We can compare both!

With the Prosper dataset, try to compare in a graphical manner the survival function based on empirical distribution function of the time to default and based on the Kaplan-Meier estimator.

References

Greenwood, M. 1926. “The Natural Duration of Cancer.” Edited by His majesty’s Stationery Office. Report on Public Health and Medical Subjects 33. London, United Kindom.