2.2 Pointwise confidence interval for \(S(t)\)
For the contruction of the confidence interval for the estimated survival we can use a well-know estimator of the variance, the Greenwood estimator (Greenwood 1926). The Greenwood variance estimate for a Kaplan-Meier curve is defined as
\[ \hat \sigma^2[\hat S(t)] = \widehat var[\hat S(t)] = \hat S(t)^2 \sum_{i:t_i \le t} \frac{d_i}{n_i(n_i-d_i)} \]
In case of no censoring, this estimator reduces to
\(\hat \sigma^2[\hat S(t)] = \frac{\hat S(t) [1- \hat S(t)]}{n}\).
It is possible to use this estimator to derive a confidence interval for all time points \(t\). Assuming asintotic normality (\(\hat S(t) \simeq N(\hat S(t), \sigma(t)/\sqrt(n))\)) and let \(\sigma\) denotes the Greenwood’s standard deviation. Then confidence intervals for the survival function are then computed as follows (plain)
\[ \bigg(\hat S(t) \pm z_{1-\alpha/2} \cdot \hat \sigma/\sqrt(n) \bigg), \] where \(\hat \sigma = se(\hat S(t))\) is calculated using Greenwood’s formula.
It is important to hightlight here that this confidence interval may be out of the (0,1) interval! For solve this, the approximation to the normal distribution is improved by using the log-minus-log transformation
\[ \bigg(\hat S(t) \pm e^{z_{1-\alpha/2} \cdot \frac{\hat\sigma}{\hat S(t) ln \hat S(t)}} \bigg). \]
Other options include the log transformation \[ \exp \bigg( \ln(\hat S(t)) \pm z_{1-\alpha/2} \cdot \hat\sigma/ \hat S(t) \bigg). \]
In R
we can select these options as: log
(default), log-log
and plain
.
km1 <- survfit(Surv(time, status) ~ 1, data = loan_filtered) # conf.type = "log" (default)
summary(km1, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 200 4207 201 0.955 0.00309 0.949 0.961
## 1100 143 1130 0.626 0.01369 0.600 0.653
km2 <- survfit(Surv(time, status) ~ 1, data = loan_filtered, conf.type = "plain")
summary(km2, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered,
## conf.type = "plain")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 200 4207 201 0.955 0.00309 0.949 0.961
## 1100 143 1130 0.626 0.01369 0.599 0.653
km3 <- survfit(Surv(time, status) ~ 1, data = loan_filtered, conf.type = "log-log")
summary(km3, times = c(200, 1100))
## Call: survfit(formula = Surv(time, status) ~ 1, data = loan_filtered,
## conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 200 4207 201 0.955 0.00309 0.949 0.961
## 1100 143 1130 0.626 0.01369 0.598 0.652
times
and censored
of the function summary.survfit
.
And now… what about the empirical distribution (without taking into account the censored data)? We can compare both!
References
Greenwood, M. 1926. “The Natural Duration of Cancer.” Edited by His majesty’s Stationery Office. Report on Public Health and Medical Subjects 33. London, United Kindom.